# Timescale Vector

This notebook shows how to use the Postgres vector database (`TimescaleVector`).

## What is Timescale Vector?
**[Timescale Vector](https://www.timescale.com/ai) is PostgreSQL++ for AI applications.**

Timescale Vector enables you to efficiently store and query billions of vector embeddings in `PostgreSQL`.
- Enhances `pgvector` with faster and more accurate similarity search on 1B+ vectors via DiskANN inspired indexing algorithm.
- Enables fast time-based vector search via automatic time-based partitioning and indexing.
- Provides a familiar SQL interface for querying vector embeddings and relational data.

Timescale Vector scales with you from POC to production:
- Simplifies operations by enabling you to store relational metadata, vector embeddings, and time-series data in a single database.
- Benefits from rock-solid PostgreSQL foundation with enterprise-grade feature liked streaming backups and replication, high-availability and row-level security.
- Enables a worry-free experience with enterprise-grade security and compliance.

## How to use Timescale Vector
Timescale Vector is available on [Timescale](https://www.timescale.com/products), the cloud PostgreSQL platform. (There is no self-hosted version at this time.)

- LangChain users get a 90-day free trial for Timescale Vector.
- To get started, [signup](https://console.cloud.timescale.com/signup) to Timescale, create a new database and follow this notebook!
- See the [installation instructions](https://github.com/timescale/python-vector) for more details on using Timescale Vector in python.

## Setup

In [17]:
# Pip install necessary packages
!pip install timescale-vector
!pip install openai
!pip install tiktoken



In this example, we'll use `OpenAIEmbeddings`, so let's load your OpenAI API key.

In [18]:
import os
import getpass

# Get the API key and save it as an environment variable
os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")

In [19]:
## Loading Environment Variables
from typing import List, Tuple
from dotenv import load_dotenv

load_dotenv()

False

In [20]:
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores.timescalevector import TimescaleVector
from langchain.document_loaders import TextLoader
from langchain.docstore.document import Document

## Similarity Search with Euclidean Distance (Default)

We'll look at an example of doing a similarity search query on the State of the Union speech to find the most similar sentences to a given query sentence. We'll use the [Euclidean distance](https://en.wikipedia.org/wiki/Euclidean_distance) as our similarity metric.

In [21]:
# Load the text and split it into chunks
loader = TextLoader("../../../extras/modules/state_of_the_union.txt")
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)

embeddings = OpenAIEmbeddings()

To connect to your PostgreSQL database, you'll need your service URI, which can be found in the cheatsheet file you downloaded after creating a new database. The URI will look something like this: `postgres://tsdbadmin:<password>@<id>.tsdb.cloud.timescale.com:<port>/tsdb?sslmode=require`

In [22]:
# Timescale Vector needs the service url to your cloud database. You can see this as soon as you create the 
# service in the cloud UI or in your credentials.sql file
SERVICE_URL = "postgres://tsdbadmin:<password>@<id>.tsdb.cloud.timescale.com:<port>/tsdb?sslmode=require"
SERVICE_URL = "postgres://cevian@localhost:28815/timescaledb_vector"

# # You can get it from an enviornment variables. We suggest using a .env file.
# import os

# SERVICE_URL = os.environ.get("TIMESCALE_SERVICE_URL", "")

In [23]:
# The TimescaleVector Module will try to create a table with the name of the collection.
# So, make sure that the collection name is unique and the user has the permission to create a table.
COLLECTION_NAME = "state_of_the_union_test"

# Create a Timescale Vector instance from the collection of documents
db = TimescaleVector.from_documents(
    embedding=embeddings,
    documents=docs,
    collection_name=COLLECTION_NAME,
    service_url=SERVICE_URL,
)

In [24]:
query = "What did the president say about Ketanji Brown Jackson"
docs_with_score = db.similarity_search_with_score(query)

In [25]:
for doc, score in docs_with_score:
    print("-" * 80)
    print("Score: ", score)
    print(doc.page_content)
    print("-" * 80)

--------------------------------------------------------------------------------
Score:  0.18459315693659817
Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. 

Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. 

One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. 

And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.
--------------------------------------------------------------------------------
----------------------

## Working with Timescale Vector

In the example above, we created a vectorstore from a collection of documents. However, often we want to work insert data into and query data from an existing vectorstore. Let's see how to initialize, add documents to, and query an existing collection of documents in a TimescaleVector vector store.

In [26]:
# Initialize a TimescaleVector store
store = TimescaleVector(
    collection_name=COLLECTION_NAME,
    service_url=SERVICE_URL,
    embedding_function=embeddings,
)

In [27]:
# Add documents to a collection in TimescaleVector
ids = store.add_documents([Document(page_content="foo")])
ids

['30369e58-412b-11ee-8c06-6ee10b77fd07']

In [28]:
# Query the vectorstore for similar documents
docs_with_score = db.similarity_search_with_score("foo")

In [29]:
docs_with_score[0]

(Document(page_content='foo', metadata={}), 0.0)

In [30]:
docs_with_score[1]

(Document(page_content='foo', metadata={}), 0.0)

### Deleting Data 

You can delete data by uuid or by a filter on the metadata.

In [31]:
ids = store.add_documents([Document(page_content="Bar")])

store.delete(ids)

True

Deleting using metadata is especially useful if you want to periodically update information scraped from a particular source.

In [32]:
store.add_documents([Document(page_content="Hello World", metadata={"source": "www.example.com/hello"})])
store.add_documents([Document(page_content="Adios", metadata={"source": "www.example.com/adios"})])

store.delete_by_metadata({"source": "www.example.com/adios"})

store.add_documents([Document(page_content="Adios, but newer!", metadata={"source": "www.example.com/adios"})])

['32669ebc-412b-11ee-8c06-6ee10b77fd07']

### Overriding a vectorstore

If you have an existing collection, you override it by doing `from_documents` and setting `pre_delete_collection` = True

In [33]:
db = TimescaleVector.from_documents(
    documents=docs,
    embedding=embeddings,
    collection_name=COLLECTION_NAME,
    service_url=SERVICE_URL,
    pre_delete_collection=True,
)

In [34]:
docs_with_score = db.similarity_search_with_score("foo")

In [35]:
docs_with_score[0]

(Document(page_content='A former top litigator in private practice. A former federal public defender. And from a family of public school educators and police officers. A consensus builder. Since she’s been nominated, she’s received a broad range of support—from the Fraternal Order of Police to former judges appointed by Democrats and Republicans. \n\nAnd if we are to advance liberty and justice, we need to secure the Border and fix the immigration system. \n\nWe can do both. At our border, we’ve installed new technology like cutting-edge scanners to better detect drug smuggling.  \n\nWe’ve set up joint patrols with Mexico and Guatemala to catch more human traffickers.  \n\nWe’re putting in place dedicated immigration judges so families fleeing persecution and violence can have their cases heard faster. \n\nWe’re securing commitments and supporting partners in South and Central America to host more refugees and secure their own borders.', metadata={'source': '../../../extras/modules/sta

### Using a Timescale Vector as a Retriever

After initializing a TimescaleVector store, you can use it as a [retriever](https://python.langchain.com/docs/modules/data_connection/retrievers/).

In [36]:
retriever = store.as_retriever()

In [37]:
print(retriever)

tags=['TimescaleVector', 'OpenAIEmbeddings'] metadata=None vectorstore=<langchain.vectorstores.timescalevector.TimescaleVector object at 0x12f5980d0> search_type='similarity' search_kwargs={}


## Advanced Usage



### Speeding up queries by creating an index

You can speed up similarity queries by creating an index on the embedding column. You should only do this once you have ingested a large part of your data.

In [38]:
store.create_ivfflat_index()