<a href="https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/examples/vector_stores/postgres.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Postgres Vector Store

This notebook shows how we can create a llI seeamaindex in PostgresSQL (PGVector) as opposed to in-memory, from data that has already been prepared for indexing (chunking, embeddings generations...) in `ailab-db`. 

Testings on our azure pg show a disappointing `25 seconds` delay vs `<0.5 seconds` on local pg. It is worth investigating the configuration differences between the local pg and the azure one that could cause such a drastic jump.

We noticed that on either db, no index is actually created on the embedding column of the vector store table (`data_llamaindex`). And looking closely in the llamaindex codebase, there is no obvious mention of it's creation. So we created one manually using:

`CREATE INDEX ON data_llamaindex USING hnsw (embedding vector_cosine_ops);`

The delay is now `1.13 seconds` with hnsw index vs `25 seconds` without.

This is a huge improvement. We should also consider that our current azure pg instance is a development one, less powerful than the one meant for production.



In [None]:
# %pip install -r ../../requirements.txt

In [1]:
import logging
import sys
from llama_index.core import StorageContext
from llama_index.core import VectorStoreIndex
from llama_index.vector_stores.postgres import PGVectorStore
from llama_index.embeddings.azure_openai import AzureOpenAIEmbedding
from llama_index.llms.azure_openai import AzureOpenAI
from llama_index.core import Settings
import os
from dotenv import load_dotenv
from llama_index.storage.index_store.postgres import PostgresIndexStore
from llama_index.storage.docstore.postgres import PostgresDocumentStore
import psycopg
from psycopg.rows import dict_row
import json
import pickle
from sqlalchemy import make_url
from llama_index.core.schema import TextNode
from tqdm import tqdm
from llama_index.readers.web import SimpleWebPageReader
from llama_index.core.extractors import QuestionsAnsweredExtractor
import random
from pprint import pprint

load_dotenv()

# Uncomment to see debug logs
# logging.basicConfig(stream=sys.stdout, level=logging.DEBUG)
# logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

True

In [2]:
def save_to_pickle(data, filename):
    with open(filename, "wb") as file:
        pickle.dump(data, file)

def load_from_pickle(filename):
    with open(filename, "rb") as file:
        return pickle.load(file)

### Setup LLM and Embed Model

In [3]:
llm = AzureOpenAI(
    model="gpt-4",
    deployment_name="ailab-llm",
    api_key=os.getenv("API_KEY"),
    azure_endpoint=os.getenv("AZURE_ENDPOINT"),
    api_version=os.getenv("API_VERSION"),
)

embed_model = AzureOpenAIEmbedding(
    model="text-embedding-ada-002",
    deployment_name="ada",
    api_key=os.getenv("API_KEY"),
    azure_endpoint=os.getenv("AZURE_ENDPOINT"),
    api_version=os.getenv("API_VERSION"),
)

Settings.llm = llm
Settings.embed_model = embed_model

### Creating nodes from louis_v005.documents

In [8]:
database=os.getenv('DB_NAME')
host=os.getenv('DB_HOST')
password=os.getenv('DB_PASSWORD')
port=os.getenv('DB_PORT')
user=os.getenv('DB_USER')

conn_string = (
    f"dbname={database} "
    f"user={user} "
    f"password={password} "
    f"host={host} "
    f"port={port}"
)

In [9]:
query = """
    SELECT id, content, embedding, chunk_id, url, title, subtitle, tokens_count, last_updated, score
    FROM louis_v005.documents
"""
nodes = []
with psycopg.connect(conn_string) as conn:
    with conn.cursor(row_factory=dict_row) as cur:
        results = cur.execute(query).fetchall()
        for r in tqdm(results, desc="Processing records"):
            node = TextNode(
                text=r["content"],
                id_=str(r["id"]),
                embedding=json.loads(r["embedding"]),
            )
            node.metadata = {
                "id": str(r["id"]),
                'chunk_id': str(r['chunk_id']),
                'url': r['url'],
                'title': r['title'],
                'subtitle': r['subtitle'],
                'tokens_count': r['tokens_count'],
                'last_updated': (r['last_updated']),
                'score': r['score']
            }
            nodes.append(node)

print(nodes[0])
save_to_pickle(nodes, "nodes.pkl")

### Create the Database

In [14]:
connection_string=conn_string
# connection_string = "postgresql://postgres:testpwd@localhost:5432"
new_database = "llamaindexdb"

with psycopg.connect(connection_string) as conn:
    conn.autocommit = True
    with conn.cursor() as cur:
        cur.execute(f"DROP DATABASE IF EXISTS {new_database}")
        cur.execute(f"CREATE DATABASE {new_database}")

### Create the indexes

In [15]:
vector_store = PGVectorStore.from_params(
    database=new_database,
    host=host,
    password=password,
    port=port,
    user=user,
    embed_dim=1536,
)

document_store = PostgresDocumentStore.from_params(    
    database=new_database,
    host=host,
    password=password,
    port=port,
    user=user,
)

index_store = PostgresIndexStore.from_params(
    database=new_database,
    host=host,
    password=password,
    port=port,
    user=user,
)

storage_context = StorageContext.from_defaults(
    docstore=document_store,
    index_store=index_store, 
    vector_store=vector_store, 
)

storage_context.docstore.add_documents(nodes)

index = VectorStoreIndex(nodes, storage_context=storage_context)

retriever = index.as_retriever(similarity_top_k=5)



In [None]:
conn_string_new_db = (
    f"dbname={new_database} "
    f"user={user} "
    f"password={password} "
    f"host={host} "
    f"port={port}"
)

with psycopg.connect(conn_string_new_db) as conn:
    conn.autocommit = True
    with conn.cursor() as cur:
        cur.execute("""
        CREATE INDEX ON public.data_llamaindex USING hnsw (embedding vector_cosine_ops);
        """)

### Testing

#### Generating a question from a random url

In [31]:
query = """
    SELECT c.url
    FROM louis_v005.crawl as c
    """
with psycopg.connect(conn_string) as conn:
    with conn.cursor() as cur:
        results = cur.execute(query).fetchall()
        urls = [r[0] for r in results]

urls = [url for url in urls if "/fra/" not in url]
pprint(urls[0:5])
save_to_pickle(urls, "urls.pkl")

['https://inspection.canada.ca/preventive-controls/sampling-procedures/eng/1518033335104/1528203403149',
 'https://inspection.canada.ca/eng/1664715510668/1664715511012',
 'https://inspection.canada.ca/plant-health/potatoes/potato-varieties/norland/eng/1312587385821/1312587385822',
 'https://inspection.canada.ca/eng/1653077788730/1653077789089',
 'https://inspection.canada.ca/plant-health/potatoes/references/eng/1326492425237/1326492502093']


In [5]:
urls = load_from_pickle("urls.pkl")
# random_url = random.choice(urls)
random_url = urls[0]
documents = SimpleWebPageReader(html_to_text=True).load_data([random_url])
assert len(documents)==1
extractor = QuestionsAnsweredExtractor(questions=1)
questions = await extractor.aextract(documents)


100%|██████████| 1/1 [00:08<00:00,  8.05s/it]


In [6]:
print("url", random_url)
question = questions[0]["questions_this_excerpt_can_answer"].removeprefix("Question: ")
print(question)

url https://inspection.canada.ca/preventive-controls/sampling-procedures/eng/1518033335104/1528203403149
What are the steps and considerations involved in the sampling procedures for food safety according to the Canadian Food Inspection Agency?


#### Checking if querying the index returns the right url

In [9]:
vector_store = PGVectorStore.from_params(
    database=new_database,
    host=host,
    password=password,
    port=port,
    user=user,
    embed_dim=1536,
)

index = VectorStoreIndex.from_vector_store(vector_store=vector_store)
retriever = index.as_retriever(similarity_top_k=5)

In [10]:
# import time
# start_time = time.time()
nodes = retriever.retrieve(question)

# end_time = time.time()
# elapsed_time = end_time - start_time
# print(f"Elapsed time: {elapsed_time:.2f} seconds")

start.............
get_agg_embedding_from_queries: 0.20 seconds
_build_vector_store_query: 0.00 seconds
_vector_store.query: 1.20 seconds
_retrieve: 1.39 seconds
_handle_recursive_retrieval: 0.00 seconds


In [11]:
print(nodes[0].dict())

{'node': {'id_': '4916a845-5358-42e8-93ed-be8b7beddf55', 'embedding': None, 'metadata': {'id': '4916a845-5358-42e8-93ed-be8b7beddf55', 'chunk_id': 'def3245a-e853-46f4-84e4-6b0f32a04983', 'url': 'https://inspection.canada.ca/inspection-and-enforcement/guidance-for-food-inspection-activities/sample-collection/as-required-food-sample-collection/eng/1653062252765/1653062253358', 'title': 'Operational procedure: As required food sample collection - Canadian Food Inspection Agency', 'subtitle': 'On this page;1.0 Purpose;2.0 Authorities', 'tokens_count': 267, 'last_updated': '2022-06-20', 'score': 0.5025328114227888}, 'excluded_embed_metadata_keys': [], 'excluded_llm_metadata_keys': [], 'relationships': {}, 'text': 'On this page 1.0 Purpose 2.0 Authorities 3.0 Reference documents 4.0 Definitions 5.0 Acronyms 6.0 Operational procedure 6.1 Prepare for the inspection 6.2 Conduct the inspection 6.3 Communicate the inspection results 6.4 Conduct the follow-up inspection 7.0 Appendix Annex B: DSDP 

In [41]:
found = False
for i, n in enumerate(nodes):
    if n.metadata["url"] == random_url:
        found = True
        print(f"Position: {i+1}", n.metadata["title"])

if not found:
    print("Right:", random_url)
    for n in nodes:
        print("Wrong: ", n.metadata["url"])

Right: https://inspection.canada.ca/preventive-controls/sampling-procedures/eng/1518033335104/1528203403149
Wrong:  https://inspection.canada.ca/inspection-and-enforcement/guidance-for-food-inspection-activities/sample-collection/as-required-food-sample-collection/eng/1653062252765/1653062253358
Wrong:  https://inspection.canada.ca/food-safety-for-industry/food-safety-rules-for-small-business/eng/1643050798737/1643050800221
Wrong:  https://inspection.canada.ca/food-safety-for-industry/information-for-media/eng/1528746083978/1528746084227
Wrong:  https://inspection.canada.ca/importing-food-plants-or-animals/food-imports/step-by-step-guide/eng/1523979839705/1523979840095
Wrong:  https://inspection.canada.ca/inspection-and-enforcement/guidance-for-food-inspection-activities/sample-collection/eng/1589914459022/1589914459318
