# Time-weighted vector store retriever

This retriever uses a combination of semantic similarity and a time decay.

The algorithm for scoring them is:
`semantic_similarity + (1.0 - decay_rate) ^ hours_passed`

Notably, hours_passed refers to the hours passed since the object in the retriever was last accessed, not since it was created. This means that frequently accessed objects remain "fresh".



In [1]:
import os
import openai
from dotenv import load_dotenv
from langchain.chat_models import ChatOpenAI
load_dotenv()

api_key = os.getenv("OPEN_AI_KEY")
organization = os.getenv("OPEN_AI_ORG")

from langchain.llms import OpenAI
llm = OpenAI(
    temperature=0,
    openai_api_key=api_key,
    openai_organization=organization,
)

- A `retriever` is an interface that returns documents given an unstructured query. It is more general than a `vector store`. 
- A `retriever` does not need to be able to store documents, only to return (or retrieve) them. 
- `Vector stores` can be used as the backbone of a retriever, but there are other types of retrievers as well.

In [2]:
import faiss
from datetime import datetime, timedelta
from langchain.docstore import InMemoryDocstore
from langchain.embeddings import OpenAIEmbeddings
from langchain.retrievers import TimeWeightedVectorStoreRetriever
from langchain.schema import Document
from langchain.vectorstores import FAISS

# Define your embedding model
embeddings_model = OpenAIEmbeddings(
    openai_api_key=api_key,
    openai_organization=organization,
)

## Low decay rate
A low decay rate (in this, to be extreme, we will set it close to 0) means memories will be "remembered" for longer. A decay rate of 0 means memories never be forgotten, making this retriever equivalent to the vector lookup.

In [4]:
# Initialize the vectorstore as empty
embedding_size = 1536
index = faiss.IndexFlatL2(embedding_size)
vectorstore = FAISS(
    embedding_function = embeddings_model.embed_query, 
    index = index, 
    docstore = InMemoryDocstore({}), 
    index_to_docstore_id = {}
)
retriever = TimeWeightedVectorStoreRetriever(
    vectorstore=vectorstore, 
    decay_rate=.0000000000000000000000001, 
    k=1
)

`embedding_function` is expected to be an Embeddings object, support for passing in a function will soon be removed.


In [53]:
yesterday = datetime.now() - timedelta(days=1)
retriever.add_documents([Document(page_content="hello world", metadata={"last_accessed_at": yesterday})])
retriever.add_documents([Document(page_content="hello foo")])

['c00bf3e9-d340-4292-9e43-22c92c4c0d2b']

In [54]:
# "Hello World" is returned first because it is most salient, and the decay rate is close to 0., meaning it's still recent enough
retriever.get_relevant_documents("hello world")

[Document(page_content='hello world', metadata={'last_accessed_at': datetime.datetime(2023, 12, 13, 13, 24, 59, 630256), 'created_at': datetime.datetime(2023, 12, 13, 13, 24, 57, 943449), 'buffer_idx': 0})]

## High decay rate
With a high decay rate (e.g., several 9's), the recency score quickly goes to 0! If you set this all the way to 1, recency is 0 for all objects, once again making this equivalent to a vector lookup.

In [16]:
# Initialize the vectorstore as empty
embedding_size = 1536
index = faiss.IndexFlatL2(embedding_size)
# Index that stores the full vectors and performs exhaustive search
vectorstore = FAISS(
    embedding_function = embeddings_model.embed_query, 
    index = index, 
    docstore = InMemoryDocstore({}), 
    index_to_docstore_id = {}
)
retriever = TimeWeightedVectorStoreRetriever(
    vectorstore=vectorstore, 
    decay_rate=.999, 
    k=1
)

`embedding_function` is expected to be an Embeddings object, support for passing in a function will soon be removed.


In [17]:
yesterday = datetime.now() - timedelta(days=1)
retriever.add_documents([Document(page_content="hello world", metadata={"last_accessed_at": yesterday})])
retriever.add_documents([Document(page_content="hello foo")])

['01bf581e-3398-43cd-b13f-228ef0094ed7']

In [31]:
new_retriever = retriever.vectorstore.as_retriever()
new_retriever.vectorstore.__dict__

{'embedding_function': OpenAIEmbeddings(client=<openai.resources.embeddings.Embeddings object at 0x11ead8bd0>, async_client=<openai.resources.embeddings.AsyncEmbeddings object at 0x11eafa690>, model='text-embedding-ada-002', deployment='text-embedding-ada-002', openai_api_version='', openai_api_base=None, openai_api_type='', openai_proxy='', embedding_ctx_length=8191, openai_api_key='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', openai_organization='org-TgLUfkcRqwKG1gpiZ4ff9N44', allowed_special=set(), disallowed_special='all', chunk_size=1000, max_retries=2, request_timeout=None, headers=None, tiktoken_model_name=None, show_progress_bar=False, model_kwargs={}, skip_empty=False, default_headers=None, default_query=None, http_client=None),
 'index': <faiss.swigfaiss.IndexFlat; proxy of <Swig Object of type 'faiss::IndexFlat *' at 0x11eccdef0> >,
 'docstore': <langchain.docstore.in_memory.InMemoryDocstore at 0x1203279d0>,
 'index_to_docstore_id': {0: 'd850435f-d58a-4380-af8a-b08a

In [75]:
# "Hello Foo" is returned first because "hello world" is mostly forgotten
retriever.get_relevant_documents("hello world")

[Document(page_content='hello foo', metadata={'last_accessed_at': datetime.datetime(2023, 12, 13, 13, 25, 43, 249315), 'created_at': datetime.datetime(2023, 12, 13, 13, 25, 42, 891064), 'buffer_idx': 2})]

## Virtual time
Using some utils in LangChain, you can mock out the time component.

In [64]:
# from langchain.utils import mock_now
# import datetime
# # Notice the last access time is that date time
# with mock_now(datetime.datetime(2023, 12, 13, 13, 11)):
#     print(retriever.get_relevant_documents("hello world"))

In [76]:
retriever.memory_stream

[Document(page_content='hello foo', metadata={'last_accessed_at': datetime.datetime(2023, 12, 13, 13, 25, 36, 857100), 'created_at': datetime.datetime(2023, 12, 13, 13, 25, 34, 23614), 'buffer_idx': 0}),
 Document(page_content='hello world', metadata={'last_accessed_at': datetime.datetime(2023, 12, 12, 13, 25, 42, 168102), 'created_at': datetime.datetime(2023, 12, 13, 13, 25, 42, 168199), 'buffer_idx': 1}),
 Document(page_content='hello foo', metadata={'last_accessed_at': datetime.datetime(2023, 12, 13, 13, 25, 43, 249315), 'created_at': datetime.datetime(2023, 12, 13, 13, 25, 42, 891064), 'buffer_idx': 2})]

## Extract Vectorstore from retriever

In [102]:
import pickle
import os
vectordb_path = 'vectordb_path'
db_file_name = 'test'

retriever.vectorstore.save_local(
    folder_path = os.path.join(vectordb_path, db_file_name),
    index_name = 'index' #default
)

In [104]:
vectordb_path = 'vectordb_path'
db_file_name = 'test'

vector = FAISS.load_local(
    folder_path = os.path.join(vectordb_path, db_file_name),
    embeddings  = embeddings_model,
    index_name = "index",
)

retriever = vector.as_retriever()

In [105]:
retriever.get_relevant_documents("hello world")

[Document(page_content='hello world', metadata={'last_accessed_at': datetime.datetime(2023, 12, 12, 13, 25, 42, 168102), 'created_at': datetime.datetime(2023, 12, 13, 13, 25, 42, 168199), 'buffer_idx': 1}),
 Document(page_content='hello foo', metadata={'last_accessed_at': datetime.datetime(2023, 12, 13, 13, 25, 36, 857100), 'created_at': datetime.datetime(2023, 12, 13, 13, 25, 34, 23614), 'buffer_idx': 0}),
 Document(page_content='hello foo', metadata={'last_accessed_at': datetime.datetime(2023, 12, 13, 13, 25, 43, 249315), 'created_at': datetime.datetime(2023, 12, 13, 13, 25, 42, 891064), 'buffer_idx': 2})]

In [106]:
retriever = TimeWeightedVectorStoreRetriever(
    vectorstore=vector, 
    decay_rate=.999, 
    k=1
)

In [107]:
vectorstore.similarity_search('',k=100)

[Document(page_content='hello world', metadata={'last_accessed_at': datetime.datetime(2023, 12, 12, 13, 36, 39, 213944), 'created_at': datetime.datetime(2023, 12, 13, 13, 36, 39, 215179), 'buffer_idx': 0}),
 Document(page_content='hello foo', metadata={'last_accessed_at': datetime.datetime(2023, 12, 13, 13, 36, 40, 78398), 'created_at': datetime.datetime(2023, 12, 13, 13, 36, 40, 78398), 'buffer_idx': 1})]

In [108]:
for doc in vectorstore.similarity_search('',k=100):
    retriever.add_documents(doc)

AttributeError: 'tuple' object has no attribute 'metadata'