<a href="https://colab.research.google.com/github/antonum/Timescale-Workshops/blob/main/PGVector_Basics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# PGVector Basics

![Timescale logo](https://docs.timescale.com/static/logo-white-c6cf9b4c58cd066908a6335ece1957fd.svg)

pgvector is a postgres extension that allows you to store and manipulate vector data insde the postgres database



In [1]:
import psycopg2
from google.colab import userdata
#CONNECTION="postgres://tsdbadmin:xxxxxxx.yyyyy.tsdb.cloud.timescale.com:39966/tsdb?sslmode=require"
CONNECTION=userdata.get('TS_CONNECTION')
conn = psycopg2.connect(CONNECTION)
cursor = conn.cursor()

In [2]:
import pandas as pd
# helper function to convert SQL Results to the dataframe
def sql_results_to_df(cursor):
  columns = [desc[0] for desc in cursor.description]
  data = cursor.fetchall()
  df = pd.DataFrame(data, columns=columns)
  return df

In [3]:
query = """
DROP TABLE IF EXISTS test_vector CASCADE;
"""
cursor.execute(query)
conn.commit()

In [5]:
query = """
CREATE TABLE IF NOT EXISTS test_vector (
  id bigserial PRIMARY KEY,
  text1 text,
  embedding1 vector(3)
  );
"""
cursor.execute(query)
conn.commit()

In [None]:
# generate test data
query = """
INSERT INTO test_vector (text1, embedding1) VALUES
('example text 1', '[1,2,3]'),
('example text 2', '[3,4,5]'),
('example text 3', '[10,11,12]'),
('example text 4', '[13,14,15]');
-- Add more rows as needed
"""
cursor.execute(query)
conn.commit()


In [7]:
query = """
SELECT *
FROM test_vector;
"""
cursor.execute(query)
sql_results_to_df(cursor)

Unnamed: 0,id,text1,embedding1
0,1,example text 1,"[1,2,3]"
1,2,example text 2,"[3,4,5]"
2,3,example text 3,"[10,11,12]"
3,4,example text 4,"[13,14,15]"


In [11]:
# find the closest matches between vectors in the database and target vector [1,2,3]
query = """
SELECT
  id,
  text1,
  embedding1,
  embedding1 <=> '[1,2,3]' AS distance
FROM test_vector
ORDER BY distance
LIMIT 10;
"""
cursor.execute(query)
sql_results_to_df(cursor)

Unnamed: 0,id,text1,embedding1,distance
0,1,example text 1,"[1,2,3]",0.0
1,2,example text 2,"[3,4,5]",0.017292
2,3,example text 3,"[10,11,12]",0.048742
3,4,example text 4,"[13,14,15]",0.053744


In the statement above `<=>` operator is a cosine distance. Beside cosine distance you might want to try L2 distance `(<->)` and inner product `(<#>)` in your queries.

For most AI - related use-cases cosine distance is the best option to start with.

## Hybrid queries or queries with filters

One of the differentiators of the Timescale and Postgres in general is that vector operations are not standalone queries, but just yet another option in the SQL. Therefore you can use any SQL expression from simple WHERE clouse to complex JOINs, UNIONs etc.in combination with the vector query.

In [27]:
# generate test data
query = """
INSERT INTO test_vector (text1, embedding1) VALUES
('sample text 1', '[1,2,3]'),
('sample text 2', '[3,4,5]'),
('test text 1', '[10,11,12]'),
('test text 1', '[13,14,15]')
-- Add more rows as needed
"""
cursor.execute(query)
conn.commit()

In [14]:
# find the closest matches between vectors in the database and target vector [1,2,3], filter by text1 value
query = """
SELECT
  id,
  text1,
  embedding1,
  embedding1 <=> '[1,2,3]' AS distance
FROM test_vector
WHERE
  text1 LIKE 'sample%'
ORDER BY distance
LIMIT 10;
"""
cursor.execute(query)
sql_results_to_df(cursor)

Unnamed: 0,id,text1,embedding1,distance
0,5,sample text 1,"[1,2,3]",0.0
1,6,sample text 2,"[3,4,5]",0.017292


In [4]:
conn.commit()

## Improve performance by adding Vector index

You can improve performance of your queries by adding a vector index. Timescale support `HNSW`
and `IVFFlat` indices of PGVector as well as advances `StreamingDiskANN` vector index.

Let's create `StreamingDiskANN` index for our table.

In [16]:
query = """
CREATE INDEX IF NOT EXISTS embedding1Idx ON test_vector USING diskann (embedding1 vector_cosine_ops);
"""
cursor.execute(query)
conn.commit()


Note that this index is specific to Cosine distance `vector_cosine_ops`. It would help to speed up queries usin cosine distance operator `<=>`. Other distance operator such as L2 would still work, but would not use that index.