# Similarity Search with time-based filtering using LlamaIndex

A key use case for Timescale Vector is efficient time-based vector search. Timescale Vector enables this by automatically partitioning vectors (and associated metadata) by time. This allows you to efficiently query vectors by both similarity to a query vector and time because we can exclude entire partitions that don't overlap with the query time.

Time-based vector search functionality is helpful for applications like:
- Storing and retrieving LLM response history (e.g. chatbots)
- Finding the most recent embeddings that are similar to a query vector (e.g recent news).
- Constraining similarity search to a relevant time range (e.g asking time-based questions about a knowledge base)

To illustrate how to use TimescaleVector's time-based vector search functionality, we'll ask questions about the git log history for TimescaleDB . We'll illustrate how to add documents with a time-based uuid and how run similarity searches with time range filters.

## Setup your environment

First, install the necessary prerequisites

In [None]:
# Pip install necessary packages
%pip install -U timescale-vector
%pip install -U openai
%pip install -U llama_index
%pip install -U llama-index-readers-json
%pip install -U llama-index-embeddings-openai
%pip install -U llama-index-vector-stores-timescalevector
%pip install -U python-dotenv

Next, setup your secrets. You'll need an API key from [OpenAI](https://platform.openai.com/) and a service url from [Timescale](https://console.cloud.timescale.com/signup?utm_campaign=vectorlaunch&utm_source=aicookbooks&utm_medium=referral). For security, we suggest storing these in a dotenv file.

In [None]:
import os

# Get openAI api key by reading local .env file
from dotenv import load_dotenv, find_dotenv

_ = load_dotenv(find_dotenv(), override=True)
OPENAI_API_KEY = os.environ["OPENAI_API_KEY"]
TIMESCALE_SERVICE_URL = os.environ["TIMESCALE_SERVICE_URL"]

## Extract content and metadata from git log JSON
First lets load in the git log data into a new collection in our PostgreSQL database named `timescale_commits`

You'll need to [download the sample dataset](https://s3.amazonaws.com/assets.timescale.com/ai/ts_git_log.json) and place it in the same directory as this notebook.

You can use following command:

In [None]:
# Download the file using curl and save it as commit_history.csv
# Note: Execute this command in your terminal, in the same directory as the notebook
!curl -O "https://s3.amazonaws.com/assets.timescale.com/ai/commit_history.csv"

Next, let's create a set of functions to extract metadata from the git commits. Note, how we create the UUID based on the timestamp.

In [None]:
from timescale_vector import client
from datetime import datetime

def parse_date(date_string: str) -> datetime:
    if date_string is None:
        return None
    time_format = "%a %b %d %H:%M:%S %Y %z"
    return datetime.strptime(date_string, time_format)

# Function to take in a date string in the past and return a uuid v1
def create_uuid(date_string: str):
   datetime_obj = parse_date(date_string)
   uuid = client.uuid_from_time(datetime_obj)
   return str(uuid)

def split_name(input_string: str):
    if input_string is None:
        return None, None
    start = input_string.find("<")
    end = input_string.find(">")
    name = input_string[:start].strip()
    email = input_string[start + 1 : end].strip()
    return name, email

Then, we loade the data.

In [None]:
import pandas as pd
from pathlib import Path

from llama_index.core.schema import TextNode
# Create a Node object from a single row of data
def create_node(row):
   record = row.to_dict()
   (record_name, _)= split_name(record["author"])
   record_content = str(record["date"]) + " " + record_name + " " + str(record["change summary"]) + " " + str(record["change details"])
   node = TextNode(
       id_=create_uuid(record["date"]),
       text= record_content,
       metadata={
           'commit': record["commit"],
           'author': record_name,
           'date': str(parse_date(record["date"])),
       }
   )
   return node

# Read the CSV file into a DataFrame
file_path = Path("commit_history.csv")
df = pd.read_csv(file_path)
df = df.dropna()
nodes = [create_node(row) for _, row in df.iterrows()]

NUM_RECORDS = 20
nodes = nodes[:NUM_RECORDS]

Since this is a demo, we will only load the first 500 records. In practice, you can load as many records as you want.

Next, we will create embeddings.

In [None]:
# Create embeddings for nodes
from llama_index.embeddings.openai import OpenAIEmbedding

embedding_model = OpenAIEmbedding(model="text-embedding-3-small")

for node in nodes:
   node_embedding = embedding_model.get_text_embedding(
       node.get_content(metadata_mode="all")
   )
   node.embedding = node_embedding

## Load documents and metadata into TimescaleVector vectorstore
Now that we have prepared our documents, let's process them and load them, along with their vector embedding representations into our Timescale Vector.

First, we'll define a table name for the data.

We'll also define a time delta, which we pass to the `time_partition_interval` argument, which will be used to as the interval for partitioning the data by time. Each partition will consist of data for the specified length of time. We'll use 7 days for simplicity, but you can pick whatever value make sense for your use case -- for example if you query recent vectors frequently you might want to use a smaller time delta like 1 day, or if you query vectors over a decade long time period then you might want to use a larger time delta like 6 months or 1 year.

In [None]:
from llama_index.vector_stores.timescalevector import TimescaleVectorStore
from datetime import timedelta

# Create a timescale vector store and add the newly created nodes to it
ts_vector_store = TimescaleVectorStore.from_params(
   service_url=TIMESCALE_SERVICE_URL,
   table_name="li_commit_history",
   time_partition_interval= timedelta(days=7),
)
ts_vector_store.add(nodes)

Create an index to speed up queries. Using the `create_index()` function without additional arguments will create a timescale_vector_index by default, using the default parameters.

In [None]:
# create an index
# by default this will create a Timescale Vector (DiskANN) index
ts_vector_store.create_index()

## Querying vectors by time and similarity

Now that we have loaded our documents into TimescaleVector, we can query them by time and similarity.

TimescaleVector does this effeciently because any partition (7-day period in this example) that does not overlap with the query is completely excluded.

TimescaleVector provides multiple methods for querying vectors by doing similarity search with time-based filtering. 

Let's take a look at each method below:

In [None]:
# Time filter variables
start_dt = datetime(2023, 8, 1, 22, 10, 35)  # Start date = 1 August 2023, 22:10:35
end_dt = datetime(2023, 8, 30, 22, 10, 35)  # End date = 30 August 2023, 22:10:35
td = timedelta(days=7)  # Time delta = 7 days

query = "What's new with TimescaleDB functions?"
query_embedding = embedding_model.get_query_embedding(query)

Method 1: Filter within a provided start date and end date.

In [None]:
from llama_index.core.vector_stores import VectorStoreQuery

vector_store_query = VectorStoreQuery(
    query_embedding=query_embedding, similarity_top_k=5
)

query_result = ts_vector_store.query(
    vector_store_query, start_date=start_dt, end_date=end_dt
)

for node in query_result.nodes:
    print("-" * 80)
    print(node.metadata["date"])
    print(node.get_content(metadata_mode="all"))

Method 2: Filter within a provided start date, and a time delta later.

In [None]:
from llama_index.core.vector_stores import VectorStoreQuery

vector_store_query = VectorStoreQuery(
    query_embedding=query_embedding, similarity_top_k=5
)

query_result = ts_vector_store.query(
    vector_store_query, start_date=start_dt, timedelta=td
)

for node in query_result.nodes:
    print("-" * 80)
    print(node.metadata["date"])
    print(node.get_content(metadata_mode="all"))

Method 3: Filter within a provided end date and a time delta earlier.

In [None]:
from llama_index.core.vector_stores import VectorStoreQuery

vector_store_query = VectorStoreQuery(
    query_embedding=query_embedding, similarity_top_k=5
)

query_result = ts_vector_store.query(
    vector_store_query, end_date=end_dt, timedelta=td
)

for node in query_result.nodes:
    print("-" * 80)
    print(node.metadata["date"])
    print(node.get_content(metadata_mode="all"))

## Querying using other metadata.

You can also filter based on other metadata.  This shows a similarity search with a filter on author name.

In [None]:
from llama_index.core.vector_stores import VectorStoreQuery, MetadataFilters, MetadataFilter, FilterOperator

filters = MetadataFilters(
    filters=[
        MetadataFilter(
            key="author", operator=FilterOperator.EQ, value="Sven Klemm"
        ),
    ]
)

vector_store_query = VectorStoreQuery(
    query_embedding=query_embedding, similarity_top_k=5, filters=filters,
)

query_result = ts_vector_store.query(
    vector_store_query
)

for node in query_result.nodes:
    print("-" * 80)
    print(node.metadata["date"])
    print(node.get_content(metadata_mode="all"))

You can also combine it with time-based search. This shows a similarity search with a time filter as well as a filter on author name.

In [None]:
from llama_index.core.vector_stores import VectorStoreQuery, MetadataFilters, MetadataFilter, FilterOperator

filters = MetadataFilters(
    filters=[
        MetadataFilter(
            key="author", operator=FilterOperator.EQ, value="Sven Klemm"
        ),
    ]
)

vector_store_query = VectorStoreQuery(
    query_embedding=query_embedding, similarity_top_k=5, filters=filters,
)

query_result = ts_vector_store.query(
    vector_store_query, start_date=start_dt, end_date=end_dt
)

for node in query_result.nodes:
    print("-" * 80)
    print(node.metadata["date"])
    print(node.get_content(metadata_mode="all"))