# Timescale Vector (Postgres)

This notebook shows how to use the Postgres vector database (`TimescaleVector`). You'll learn how to use TimescaleVector for semantic search, time-based vector search and how to create indexes to speed up queries.

## What is Timescale Vector?
**[Timescale Vector](https://www.timescale.com/ai) is PostgreSQL++ for AI applications.**

Timescale Vector enables you to efficiently store and query billions of vector embeddings in `PostgreSQL`.
- Enhances `pgvector` with faster and more accurate similarity search on 1B+ vectors via DiskANN inspired indexing algorithm.
- Enables fast time-based vector search via automatic time-based partitioning and indexing.
- Provides a familiar SQL interface for querying vector embeddings and relational data.

Timescale Vector scales with you from POC to production:
- Simplifies operations by enabling you to store relational metadata, vector embeddings, and time-series data in a single database.
- Benefits from rock-solid PostgreSQL foundation with enterprise-grade feature liked streaming backups and replication, high-availability and row-level security.
- Enables a worry-free experience with enterprise-grade security and compliance.

## How to access Timescale Vector
Timescale Vector is available on [Timescale](https://www.timescale.com/products), the cloud PostgreSQL platform. (There is no self-hosted version at this time.)

- LangChain users get a 90-day free trial for Timescale Vector.
- To get started, [signup](https://console.cloud.timescale.com/signup) to Timescale, create a new database and follow this notebook!
- See the [installation instructions](https://github.com/timescale/python-vector) for more details on using Timescale Vector in python.

## Setup

In [None]:
# Pip install necessary packages
!pip install timescale-vector
!pip install openai
!pip install tiktoken

In this example, we'll use `OpenAIEmbeddings`, so let's load your OpenAI API key.

In [14]:
import os
# Run export OPENAI_API_KEY=sk-YOUR_OPENAI_API_KEY...
# Get openAI api key by reading local .env file
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv())
OPENAI_API_KEY  = os.environ['OPENAI_API_KEY']

In [2]:
import os
import getpass

# Get the API key and save it as an environment variable
os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")

In [15]:
## Loading Environment Variables
from typing import List, Tuple
#from dotenv import load_dotenv
#load_dotenv()

Next we'll import the needed Python libraries and libraries from LangChain. Note that we import the `timescale-vector` library as well as the TimescaleVector vectorstore.

In [17]:
import timescale_vector
from datetime import datetime, timedelta
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.document_loaders import TextLoader
from langchain.document_loaders.json_loader import JSONLoader
from langchain.docstore.document import Document
from langchain.vectorstores.timescalevector import TimescaleVector

## 1. Similarity Search with Euclidean Distance (Default)

We'll look at an example of doing a similarity search query on the State of the Union speech to find the most similar sentences to a given query sentence. We'll use the [Euclidean distance](https://en.wikipedia.org/wiki/Euclidean_distance) as our similarity metric.

In [18]:
# Load the text and split it into chunks
loader = TextLoader("../../../extras/modules/state_of_the_union.txt")
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)

embeddings = OpenAIEmbeddings()

To connect to your PostgreSQL database, you'll need your service URI, which can be found in the cheatsheet file you downloaded after creating a new database. The URI will look something like this: `postgres://tsdbadmin:<password>@<id>.tsdb.cloud.timescale.com:<port>/tsdb?sslmode=require`

In [19]:
# Timescale Vector needs the service url to your cloud database. You can see this as soon as you create the 
# service in the cloud UI or in your credentials.sql file
SERVICE_URL = os.environ['TIMESCALE_SERVICE_URL']

# Specify directly if testing
#SERVICE_URL = "postgres://tsdbadmin:<password>@<id>.tsdb.cloud.timescale.com:<port>/tsdb?sslmode=require"
#SERVICE_URL = "postgres://cevian@localhost:28815/timescaledb_vector"

# # You can get it from an enviornment variables. We suggest using a .env file.
# import os

# SERVICE_URL = os.environ.get("TIMESCALE_SERVICE_URL", "")

In [20]:
# The TimescaleVector Module will try to create a table with the name of the collection.
# So, make sure that the collection name is unique and the user has the permission to create a table.
COLLECTION_NAME = "state_of_the_union_test"

# Create a Timescale Vector instance from the collection of documents
db = TimescaleVector.from_documents(
    embedding=embeddings,
    documents=docs,
    collection_name=COLLECTION_NAME,
    service_url=SERVICE_URL,
)

In [21]:
query = "What did the president say about Ketanji Brown Jackson"
docs_with_score = db.similarity_search_with_score(query)

In [22]:
for doc, score in docs_with_score:
    print("-" * 80)
    print("Score: ", score)
    print(doc.page_content)
    print("-" * 80)

--------------------------------------------------------------------------------
Score:  0.1845601444077416
Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. 

Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. 

One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. 

And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.
--------------------------------------------------------------------------------
-----------------------

### Using a Timescale Vector as a Retriever
After initializing a TimescaleVector store, you can use it as a [retriever](https://python.langchain.com/docs/modules/data_connection/retrievers/).

In [23]:
# Use TimescaleVector as a retriever
retriever = store.as_retriever()

In [24]:
print(retriever)

tags=['TimescaleVector', 'OpenAIEmbeddings'] metadata=None vectorstore=<langchain.vectorstores.timescalevector.TimescaleVector object at 0x133010310> search_type='similarity' search_kwargs={}


Let's look at an example of using Timescale Vector as a retriever with the [RetrievalQA chain](https://python.langchain.com/docs/use_cases/question_answering/how_to/vector_db_qa) and the [stuff chain](https://python.langchain.com/docs/modules/chains/document/stuff).

In this example, we'll ask the same query as above, but this time we'll pass the relevant documents returned from Timescale Vector to an LLM to use as context to answer our question.

First we'll create our stuff chain:

In [25]:
# Initialize GPT3.5 model
from langchain.chat_models import ChatOpenAI
llm = ChatOpenAI(temperature = 0.1, model = 'gpt-3.5-turbo-16k')

# Initialize a RetrievalQA class from a stuff chain
from langchain.chains import RetrievalQA
qa_stuff = RetrievalQA.from_chain_type(
    llm=llm, 
    chain_type="stuff", 
    retriever=retriever,
    verbose=True,
)

In [26]:
query = "What did the president say about Ketanji Brown Jackson?"
response = qa_stuff.run(query)



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


In [27]:
print(response)

The President said that he nominated Circuit Court of Appeals Judge Ketanji Brown Jackson, who is one of our nation's top legal minds and will continue Justice Breyer's legacy of excellence.


## 2. Similarity Search with time-based filtering

A key use case for Timescale Vector is efficient time-based vector search. Timescale Vector enables this by automatically partitioning vectors and associated metadata by time. This allows you to efficiently query vectors by both similarity to a query vector and time.

Time-based vector search functionality is helpful for applications like:
- Storing and retrieving LLM response history (e.g. chatbots)
- Finding the most recent embeddings that are similar to a query vector (e.g recent news).
- Constraining similarity search to a relevant time range (e.g asking time-based questions about a knowledge base)
- Anomaly detection, where you want to find anomalous vectors within a specified time range.

To illustrate how to use TimescaleVector's time-based vector search functionality, we'll ask questions about the git log history for TimescaleDB . We'll illustrate how to add documents with a time-based uuid and how run similarity searches with time range filters.

### Extract content and metadata from git log JSON
First lets load in the git log data into a new collection in our PostgreSQL database named `timescale_commits`.

In [64]:
import json

We'll define a helper funciton to create a uuid for a document and associated vector embedding based on its timestamp. We'll use this function to create a uuid for each git log entry.

Important note: If you are working with documents and want the current date and time associated with vector for time-based search, you can skip this step. A uuid will be automatically generated when the documents are ingested by default.

In [65]:
from timescale_vector import client
# Function to take in a date string in the past and return a uuid v1
def create_uuid(date_string: str):
    if date_string is None:
        return None
    time_format = '%a %b %d %H:%M:%S %Y %z'
    datetime_obj = datetime.strptime(date_string, time_format)
    uuid = client.uuid_from_time(datetime_obj)
    return str(uuid)

Next, we'll define a metadata function to extract the relevant metadata from the JSON record. We'll pass this function to the JSONLoader. See the [JSON document loader docs](https://python.langchain.com/docs/modules/data_connection/document_loaders/json) for more details.

In [31]:
# Metadata extraction function to extract metadata from a JSON record
def extract_metadata(record: dict, metadata: dict) -> dict:
    metadata["id"] = create_uuid(record["date"])
    metadata["date"] = record["date"]
    metadata["author"] = record["author"]
    metadata["commit_hash"] = record["commit"]
    return metadata

Finally we can initialize the JSON loader to parse the JSON records. We also remove empty records for simplicity.

In [32]:
# Load data from JSON file and extract metadata
loader = JSONLoader(
    file_path='../../../extras/modules/ts_git_log.json',
    jq_schema='.commit_history[]',
    text_content=False,
    metadata_func=extract_metadata
)
documents = loader.load()

# Remove documents with None date
# This is required because we are using date as the primary key to partition the data by time
documents = [doc for doc in documents if doc.metadata["date"] is not None]

### Load documents and metadata into TimescaleVector vectorstore
Now that we have prepared our documents, let's process them and load them, along with their vector embedding representations into our TimescaleVector vectorstore.

Since this is a demo, we will only load the first 1000 records. In practice, you can load as many records as you want.

In [33]:
# Extract the first 100 elements from docs
documents = documents[:1000]

Then we use the CharacterTextSplitter to split the documents into smaller chunks if needed for easier embedding. Note that this splitting process retains the metadata for each document.

In [34]:
# Split the documents into chunks for embedding
text_splitter = CharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,)
docs = text_splitter.split_documents(documents)
embeddings = OpenAIEmbeddings()

Next we'll create a Timescale Vector instance from the collection of documents that we finished pre-processsing.

First, we'll define a collection name, which will be the name of our table in the PostgreSQL database. 

We'll also define a time delta, which will be used to as the interval for partitioning the data by time. Each partition will consist of data for the specified length of time. We'll use 7 days for simplicity, but you can pick whatever value make sense for your use case -- for example if you query recent vectors frequently you might want to use a smaller time delta like 1 day, or if you query vectors over a decade long time period then you might want to use a larger time delta like 6 months or 1 year.

Finally, we'll create the TimescaleVector instance. We specify the `ids` argument to be the `uuid` field in our metadata that we created in the pre-processing step above. We do this because we want the time part of our uuids to reflect past dates. If we wanted the current date and time to be associated with our document, we can remove the id argument and uuid's will be automatically created with the current date and time.

In [35]:
#Configure the time delta for the partitioning the table
timeDeltaArgDays = "days"
timeDeltaArgDaysValue = 7
kwargs = {timeDeltaArgDays: timeDeltaArgDaysValue}

# Define collection name
COLLECTION_NAME = "timescale_commits"

# Create a Timescale Vector instance from the collection of documents
db = TimescaleVector.from_documents(
      embedding=embeddings,
      ids = [doc.metadata["id"] for doc in docs],
      documents=docs,
      collection_name=COLLECTION_NAME,
      service_url=SERVICE_URL,
      time_partition_interval=timedelta(**kwargs),)

### Querying vectors by time and similarity

Now that we have loaded our documents into TimescaleVector, we can query them by time and similarity.

TimescaleVector provides 3 methods for querying vectors doing similarity search with time-based filtering.

- Method 1: Filter within a provided start date and end date.
- Method 2: Filter within a provided start date and time delta later.
- Method 3: Filter within a provided end_date and time delta earlier.

Let's take a look at each method below:

In [41]:
# Time filter variables
# Start date = 1 Auguest 2023, 22:10:35
start_dt = datetime(2023, 8, 1, 22, 10, 35)
# End date = 30 Auguest 2023, 22:10:35
end_dt = datetime(2023, 8, 30, 22, 10, 35)
# Time delta = 7 days
td = timedelta(days=7)

query = "What's new with TimescaleDB functions?"


In [42]:
# Method 1: Query for vectors between start_date and end_date
docs_with_score = db.similarity_search_with_score(query, start_date=start_dt, end_date=end_dt)

for doc, score in docs_with_score:
    print("-" * 80)
    print("Score: ", score)
    print("Date: ", doc.metadata["date"])
    print(doc.page_content)
    print("-" * 80)

--------------------------------------------------------------------------------
Score:  0.17487859725952148
Date:  Tue Aug 29 18:13:24 2023 +0200
{"commit": " e4facda540286b0affba47ccc63959fefe2a7b26", "author": "Sven Klemm<sven@timescale.com>", "date": "Tue Aug 29 18:13:24 2023 +0200", "change summary": "Add compatibility layer for _timescaledb_internal functions", "change details": "With timescaledb 2.12 all the functions present in _timescaledb_internal were moved into the _timescaledb_functions schema to improve schema security. This patch adds a compatibility layer so external callers of these internal functions will not break and allow for more flexibility when migrating. "}
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Score:  0.1750067576770027
Date:  Tue Aug 29 18:13:24 2023 +0200
{"commit": " e4facda540286b0affba47ccc63959fefe2a7b26", "author": "Sven Klemm<sven

In [43]:
# Method 2: Query for vectors between start_dt and a time delta td later
# Most relevant vectors between 1 August and 7 days later
docs_with_score = db.similarity_search_with_score(query, start_date=start_dt, time_delta=td)

for doc, score in docs_with_score:
    print("-" * 80)
    print("Score: ", score)
    print("Date: ", doc.metadata["date"])
    print(doc.page_content)
    print("-" * 80)


--------------------------------------------------------------------------------
Score:  0.1844592470575277
Date:  Thu Aug 3 14:30:23 2023 +0300
{"commit": " 7aeed663b9c0f337b530fd6cad47704a51a9b2ec", "author": "Dmitry Simonenko<dmitry@timescale.com>", "date": "Thu Aug 3 14:30:23 2023 +0300", "change summary": "Feature flags for TimescaleDB features", "change details": "This PR adds several GUCs which allow to enable/disable major timescaledb features:  - enable_hypertable_create - enable_hypertable_compression - enable_cagg_create - enable_policy_create "}
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Score:  0.18464880081124158
Date:  Thu Aug 3 14:30:23 2023 +0300
{"commit": " 7aeed663b9c0f337b530fd6cad47704a51a9b2ec", "author": "Dmitry Simonenko<dmitry@timescale.com>", "date": "Thu Aug 3 14:30:23 2023 +0300", "change summary": "Feature flags for TimescaleDB features", 

In [44]:
# Method 3: Query for vectors between end_dt and a time delta td earlier
# Most relevant vectors between 30 August and 7 days earlier
docs_with_score = db.similarity_search_with_score(query, end_date=end_dt, time_delta=td)

for doc, score in docs_with_score:
    print("-" * 80)
    print("Score: ", score)
    print("Date: ", doc.metadata["date"])
    print(doc.page_content)
    print("-" * 80)

--------------------------------------------------------------------------------
Score:  0.17496256977542668
Date:  Tue Aug 29 18:13:24 2023 +0200
{"commit": " e4facda540286b0affba47ccc63959fefe2a7b26", "author": "Sven Klemm<sven@timescale.com>", "date": "Tue Aug 29 18:13:24 2023 +0200", "change summary": "Add compatibility layer for _timescaledb_internal functions", "change details": "With timescaledb 2.12 all the functions present in _timescaledb_internal were moved into the _timescaledb_functions schema to improve schema security. This patch adds a compatibility layer so external callers of these internal functions will not break and allow for more flexibility when migrating. "}
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Score:  0.17509043216704734
Date:  Tue Aug 29 18:13:24 2023 +0200
{"commit": " e4facda540286b0affba47ccc63959fefe2a7b26", "author": "Sven Klemm<sve

In each result above, only vectors within the specified time range are returned. These queries are very efficient as they only need to search the relevant partitions.

We can also use this functionality for question answering, where we want to find the most relevant vectors within a specified time range to use as context for answering a question. Let's take a look at an example below, using Timescale Vector as a retriever:

In [45]:
retriever = db.as_retriever()
from langchain.chat_models import ChatOpenAI
llm = ChatOpenAI(temperature = 0.1, model = 'gpt-3.5-turbo-16k')

from langchain.chains import RetrievalQA
qa_stuff = RetrievalQA.from_chain_type(
    llm=llm, 
    chain_type="stuff", 
    retriever=retriever,
    verbose=True,
)

query = "What's new with the timescaledb functions?"
response = qa_stuff.run(query)
print(response)



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m
The timescaledb functions have undergone some changes. In one patch, the support functions for histogram, first, and last have been moved into the _timescaledb_functions schema. This change should be transparent for users who have objects using those aggregates. 

In another patch, the type support functions have also been moved into the _timescaledb_functions schema. 

Additionally, a compatibility layer has been added for the _timescaledb_internal functions. These functions were moved into the _timescaledb_functions schema in order to improve schema security. The compatibility layer ensures that external callers of these internal functions will not break and allows for more flexibility when migrating.


## 3. Using ANN Search Indexes to Speed Up Queries

You can speed up similarity queries by creating an index on the embedding column. You should only do this once you have ingested a large part of your data.

Timescale Vector supports the following indexes:
- timescale_vector_index: a disk-ann inspired graph index for fast similarity search (default).
- pgvector's HNSW index: a hierarchical navigable small world graph index for fast similarity search.
- pgvector's IVFFLAT index: an inverted file index for fast similarity search. This index is not recommended for high-dimensional embeddings.

In [66]:
# Initialize an existing TimescaleVector store
COLLECTION_NAME = "timescale_commits"
embeddings = OpenAIEmbeddings()
db = TimescaleVector(
    collection_name=COLLECTION_NAME,
    service_url=SERVICE_URL,
    embedding_function=embeddings,
)

Using the `create_index()` function without additional arguments will create a timescale_vector_index by default, using the default parameters.

In [68]:
# create an index
# by default this will create a Timescale Vector (DiskANN) index
db.create_index()

You can also specify the parameters for the index. See the Timescale Vector documentation for a full discussion of the different parameters and their effects on performance.

In [87]:
# create an index, fails if the index exists
db.create_index(index_type=TimescaleVector.IndexType.TIMESCALE_VECTOR,
                 index_name = "tsv_index_2", 
                     **{TimescaleVector.IndexOptions.TSV_MAX_ALPHA.value:1.0, 
                     TimescaleVector.IndexOptions.TSV_NUM_NEIGHBORS.value:50})   


You can also specify the index type by passing the `index_type` argument to the `create_index()` function. Here we'll also use the `index_name` argument to provide a name for the index so that we can create multiple indexes on the same table and compare the performance if desired.

In [81]:
# Create an HNSW index  
db.create_index(index_type=TimescaleVector.IndexType.PGVECTOR_HNSW,
                index_name="pgvector_hnsw_index",
                **{TimescaleVector.IndexOptions.PGV_HNSW_M.value: 16, 
                   TimescaleVector.IndexOptions.PGV_HNSW_EF.value: 64}
                   )

In [86]:
# Create an IVFFLAT index
db.create_index(index_type=TimescaleVector.IndexType.PGVECTOR_IVFFLAT,
                    index_name="ivfflat_index",
                    **{TimescaleVector.IndexOptions.PGV_IVFLAT_NUM_LISTS.value:20},
                    **{TimescaleVector.IndexOptions.PGV_IVFLAT_NUM_RECORDS.value:1000})

## 4. Working with an existing TimescaleVector vectorstore

In the examples above, we created a vectorstore from a collection of documents. However, often we want to work insert data into and query data from an existing vectorstore. Let's see how to initialize, add documents to, and query an existing collection of documents in a TimescaleVector vector store.

To work with an existing Timescale Vector store, we need to know the name of the table we want to query (`COLLECTION_NAME`) and the URL of the cloud PostgreSQL database (`SERVICE_URL`).

In [54]:
# Initialize an existing TimescaleVector store
COLLECTION_NAME = "state_of_the_union_test"
embeddings = OpenAIEmbeddings()
store = TimescaleVector(
    collection_name=COLLECTION_NAME,
    service_url=SERVICE_URL,
    embedding_function=embeddings,
)

To load new data into the table, we use the `add_document()` function. This function takes a list of documents and a list of metadata. The metadata must contain a unique id for each document. 

If you want your documents to be associated with the current date and time, you do not need to create a list of ids. A uuid will be automatically generated for each document.

If you want your documents to be associated with a past date and time, you can create a list of ids using the `uuid_from_time` function in the `timecale-vector` python library, as shown in Section 2 above. This function takes a datetime object and returns a uuid with the date and time encoded in the uuid.

In [55]:
# Add documents to a collection in TimescaleVector
ids = store.add_documents([Document(page_content="foo")])
ids

['ae0cc1de-4dee-11ee-8c82-de1e4b2a0118']

In [58]:
# Query the vectorstore for similar documents
docs_with_score = store.similarity_search_with_score("foo")

In [59]:
docs_with_score[0]

(Document(page_content='foo', metadata={}), 5.006789860928507e-06)

In [60]:
docs_with_score[1]

(Document(page_content='foo', metadata={}), 5.006789860928507e-06)

### Deleting Data 

You can delete data by uuid or by a filter on the metadata.

In [61]:
ids = store.add_documents([Document(page_content="Bar")])

store.delete(ids)

True

Deleting using metadata is especially useful if you want to periodically update information scraped from a particular source, or particular date or some other metadata attribute.

In [62]:
store.add_documents([Document(page_content="Hello World", metadata={"source": "www.example.com/hello"})])
store.add_documents([Document(page_content="Adios", metadata={"source": "www.example.com/adios"})])

store.delete_by_metadata({"source": "www.example.com/adios"})

store.add_documents([Document(page_content="Adios, but newer!", metadata={"source": "www.example.com/adios"})])

['64b491fa-4def-11ee-8c82-de1e4b2a0118']

### Overriding a vectorstore

If you have an existing collection, you override it by doing `from_documents` and setting `pre_delete_collection` = True

In [None]:
db = TimescaleVector.from_documents(
    documents=docs,
    embedding=embeddings,
    collection_name=COLLECTION_NAME,
    service_url=SERVICE_URL,
    pre_delete_collection=True,
)

In [34]:
docs_with_score = db.similarity_search_with_score("foo")

In [35]:
docs_with_score[0]

(Document(page_content='A former top litigator in private practice. A former federal public defender. And from a family of public school educators and police officers. A consensus builder. Since she’s been nominated, she’s received a broad range of support—from the Fraternal Order of Police to former judges appointed by Democrats and Republicans. \n\nAnd if we are to advance liberty and justice, we need to secure the Border and fix the immigration system. \n\nWe can do both. At our border, we’ve installed new technology like cutting-edge scanners to better detect drug smuggling.  \n\nWe’ve set up joint patrols with Mexico and Guatemala to catch more human traffickers.  \n\nWe’re putting in place dedicated immigration judges so families fleeing persecution and violence can have their cases heard faster. \n\nWe’re securing commitments and supporting partners in South and Central America to host more refugees and secure their own borders.', metadata={'source': '../../../extras/modules/sta