<a href="https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/examples/vector_stores/postgres.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# Postgres Vector Store

This notebook shows how we can create a llI seeamaindex in PostgresSQL (PGVector) as opposed to in-memory, from data that has already been prepared for indexing (chunking, embeddings generations...) in `ailab-db`.

Testings on our azure pg show a disappointing `25 seconds` delay vs `<0.5 seconds` on local pg. It is worth investigating the configuration differences between the local pg and the azure one that could cause such a drastic jump.

We noticed that on either db, no index is actually created on the embedding column of the vector store table (`data_llamaindex`). And looking closely in the llamaindex codebase, there is no obvious mention of it's creation. So we created one manually using:

`CREATE INDEX ON data_llamaindex USING hnsw (embedding vector_cosine_ops);`

The delay is now `1.13 seconds` with hnsw index vs `25 seconds` without.

This is a huge improvement. We should also consider that our current azure pg instance is a development one, less powerful than the one meant for production.


In [None]:
# %pip install -r ../../requirements.txt

In [1]:
import logging
import sys
from llama_index.core import StorageContext
from llama_index.core import VectorStoreIndex
from llama_index.vector_stores.postgres import PGVectorStore
from llama_index.embeddings.azure_openai import AzureOpenAIEmbedding
from llama_index.llms.azure_openai import AzureOpenAI
from llama_index.core import Settings
import os
from dotenv import load_dotenv
from llama_index.storage.index_store.postgres import PostgresIndexStore
from llama_index.storage.docstore.postgres import PostgresDocumentStore
import psycopg
from psycopg.sql import SQL, Identifier
from psycopg.rows import dict_row
import json
import pickle
from sqlalchemy import make_url
from llama_index.core.schema import TextNode
from tqdm import tqdm
from llama_index.readers.web import SimpleWebPageReader
from llama_index.core.extractors import QuestionsAnsweredExtractor
import random
from pprint import pprint

load_dotenv()

# Uncomment to see debug logs
# logging.basicConfig(stream=sys.stdout, level=logging.DEBUG)
# logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

True

In [2]:
def save_to_pickle(data, filename):
    with open(filename, "wb") as file:
        pickle.dump(data, file)


def load_from_pickle(filename):
    with open(filename, "rb") as file:
        return pickle.load(file)

### Setup LLM and Embed Model


In [3]:
llm = AzureOpenAI(
    model="gpt-4",
    deployment_name="ailab-llm",
    api_key=os.getenv("API_KEY"),
    azure_endpoint=os.getenv("AZURE_ENDPOINT"),
    api_version=os.getenv("API_VERSION"),
)

embed_model = AzureOpenAIEmbedding(
    model="text-embedding-ada-002",
    deployment_name="ada",
    api_key=os.getenv("API_KEY"),
    azure_endpoint=os.getenv("AZURE_ENDPOINT"),
    api_version=os.getenv("API_VERSION"),
)

Settings.llm = llm
Settings.embed_model = embed_model

### Creating nodes from louis_v005.documents


In [8]:
louis_db = os.getenv("DB_NAME")
host = os.getenv("DB_HOST")
password = os.getenv("DB_PASSWORD")
port = os.getenv("DB_PORT")
user = os.getenv("DB_USER")
llamaindex_db = "llamaindex_db_legacy"
admin_db = "postgres"
llamaindex_schema = "v_0_0_1"

In [5]:
conn_string = (
    f"dbname={louis_db} "
    f"user={user} "
    f"password={password} "
    f"host={host} "
    f"port={port}"
)

query = """
    SELECT id, content, embedding, chunk_id, url, title, subtitle, tokens_count, last_updated, score
    FROM louis_v005.documents
"""
nodes = []
with psycopg.connect(conn_string) as conn:
    with conn.cursor(row_factory=dict_row) as cur:
        results = cur.execute(query).fetchall()
        for r in tqdm(results, desc="Processing records"):
            node = TextNode(
                text=r["content"],
                id_=str(r["chunk_id"]),
                embedding=json.loads(r["embedding"]),
            )
            node.metadata = {
                "id": str(r["id"]),
                "chunk_id": str(r["chunk_id"]),
                "url": r["url"],
                "title": r["title"],
                "subtitle": r["subtitle"],
                "tokens_count": r["tokens_count"],
                "last_updated": (r["last_updated"]),
                "score": r["score"],
            }
            nodes.append(node)

print(nodes[0])
save_to_pickle(nodes, "nodes.pkl")

Processing records: 100%|██████████| 103836/103836 [01:03<00:00, 1644.90it/s]


Node ID: a8fa477f-5a9e-493a-b50a-e435a15b1bc5
Text: 6.18 Enzymes Reserved for future use 6.19 Gut modifier
ingredients 6.19.1 Prebiotics 6.19.2 Viable microorganisms 6.19.3
Acidifiers 6.19.1 Prebiotics Reserved for future use 6.19.2 Viable
microorganisms Reserved for future use 6.19.3 Acidifiers Reserved for
future use 6.20 Forage additives 1-601-019 Propionic acid Is an
organic acid, generally e...


### Create the Database


In [6]:
connection_string = (
    f"dbname={admin_db} "
    f"user={user} "
    f"password={password} "
    f"host={host} "
    f"port={port}"
)
# connection_string = "postgresql://postgres:testpwd@localhost:5432"

# with psycopg.connect(connection_string) as conn:
#     conn.autocommit = True
#     with conn.cursor() as cur:
#         cur.execute(f"DROP DATABASE IF EXISTS {llama_database}")
#         cur.execute(f"CREATE DATABASE {llama_database}")

try:
    with psycopg.connect(connection_string) as conn:
        conn.autocommit = True
        with conn.cursor() as cur:
            cur.execute(f"CREATE DATABASE {llamaindex_db}")
            print(f"Database {llamaindex_db} created.")
except psycopg.errors.DuplicateDatabase:
    print(f"Database {llamaindex_db} already exists.")

Database llamaindex_db_legacy already exists.


### Create the tables


In [9]:
vector_store = PGVectorStore.from_params(
    database=llamaindex_db,
    host=host,
    password=password,
    port=port,
    user=user,
    embed_dim=1536,
    schema_name=llamaindex_schema,
)

document_store = PostgresDocumentStore.from_params(
    database=llamaindex_db,
    host=host,
    password=password,
    port=port,
    user=user,
    schema_name=llamaindex_schema,
)

index_store = PostgresIndexStore.from_params(
    database=llamaindex_db,
    host=host,
    password=password,
    port=port,
    user=user,
    schema_name=llamaindex_schema,
)

storage_context = StorageContext.from_defaults(
    docstore=document_store,
    index_store=index_store,
    vector_store=vector_store,
)

storage_context.docstore.add_documents(nodes)

index = VectorStoreIndex(nodes, storage_context=storage_context)

retriever = index.as_retriever(similarity_top_k=5)

### Create the index

In [10]:
connection_string = (
    f"dbname={llamaindex_db} "
    f"user={user} "
    f"password={password} "
    f"host={host} "
    f"port={port}"
)

schema = Identifier(llamaindex_schema)
query = SQL(
    "CREATE INDEX ON {}.data_llamaindex USING hnsw (embedding vector_cosine_ops)"
).format(schema)

with psycopg.connect(connection_string) as conn:
    conn.autocommit = True
    with conn.cursor() as cur:
        cur.execute(query)

### Testing

#### Generating a question from a random url


In [31]:
query = """
    SELECT c.url
    FROM louis_v005.crawl as c
    """
with psycopg.connect(conn_string) as conn:
    with conn.cursor() as cur:
        results = cur.execute(query).fetchall()
        urls = [r[0] for r in results]

urls = [url for url in urls if "/fra/" not in url]
pprint(urls[0:5])
save_to_pickle(urls, "urls.pkl")

['https://inspection.canada.ca/preventive-controls/sampling-procedures/eng/1518033335104/1528203403149',
 'https://inspection.canada.ca/eng/1664715510668/1664715511012',
 'https://inspection.canada.ca/plant-health/potatoes/potato-varieties/norland/eng/1312587385821/1312587385822',
 'https://inspection.canada.ca/eng/1653077788730/1653077789089',
 'https://inspection.canada.ca/plant-health/potatoes/references/eng/1326492425237/1326492502093']


In [11]:
urls = load_from_pickle("urls.pkl")
# random_url = random.choice(urls)
random_url = urls[0]
documents = SimpleWebPageReader(html_to_text=True).load_data([random_url])
assert len(documents) == 1
extractor = QuestionsAnsweredExtractor(questions=1)
questions = await extractor.aextract(documents)

100%|██████████| 1/1 [00:02<00:00,  2.64s/it]


In [12]:
print("url", random_url)
question = questions[0]["questions_this_excerpt_can_answer"].removeprefix("Question: ")
print(question)

url https://inspection.canada.ca/preventive-controls/sampling-procedures/eng/1518033335104/1528203403149
What are the steps and considerations for collecting environmental samples for microbial testing in a food production setting, according to the Canadian Food Inspection Agency?


#### Checking if querying the index returns the right url


In [17]:
vector_store = PGVectorStore.from_params(
    database=llamaindex_db,
    host=host,
    password=password,
    port=port,
    user=user,
    embed_dim=1536,
    schema_name=llamaindex_schema
)

index = VectorStoreIndex.from_vector_store(vector_store=vector_store)
retriever = index.as_retriever(similarity_top_k=5)

In [18]:
# import time
# start_time = time.time()
nodes = retriever.retrieve(question)

# end_time = time.time()
# elapsed_time = end_time - start_time
# print(f"Elapsed time: {elapsed_time:.2f} seconds")

In [19]:
print(nodes[0].dict())

{'node': {'id_': 'e456dca5-3079-4702-b89c-d469f951526f', 'embedding': None, 'metadata': {'id': '379a866f-3802-485e-afe3-85bb4c08e238', 'chunk_id': 'e456dca5-3079-4702-b89c-d469f951526f', 'url': 'https://inspection.canada.ca/inspection-and-enforcement/guidance-for-food-inspection-activities/sample-collection/food-sample-collection/eng/1540234969218/1540235089869', 'title': 'Operational guideline: Food sample collection - Canadian Food Inspection Agency', 'subtitle': 'On this page;1.0 Purpose;2.0 Authorities', 'tokens_count': 482, 'last_updated': '2023-03-24', 'score': 0.5859392646494657}, 'excluded_embed_metadata_keys': [], 'excluded_llm_metadata_keys': [], 'relationships': {}, 'text': 'On this page 1.0 Purpose 2.0 Authorities 3.0 Reference documents 4.0 Definitions 5.0 Acronyms 6.0 Operational guideline 6.1 Prepare for the inspection 6.2 Conduct the inspection 6.3 Communicate the inspection results 6.4 Conduct the follow-up inspection 7.0 Appendix Appendix 1: Aseptic sample collection 

In [20]:
found = False
for i, n in enumerate(nodes):
    if n.metadata["url"] == random_url:
        found = True
        print(f"Position: {i+1}", n.metadata["title"])

if not found:
    print("Right:", random_url)
    for n in nodes:
        print("Wrong: ", n.metadata["url"])

Position: 3 Sampling procedures - Canadian Food Inspection Agency
