<a href="https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/examples/vector_stores/postgres.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Postgres Vector Store

This notebook shows how we can create a llamaindex in PostgresSQL (PGVector) as opposed to in-memory, from data crawled directly from `inspection.canada.ca`. 

From few informal tests on a set of 500 pages, the results were accurate. We will need to conduct formal tests on the whole document set to confirm this.

Queries are also faster: PGVector<500ms vs in-memory>3000ms. 

As can been from the indexing process, new individual documents can be added to the index on the go. Similar methods are available to remove documents. This opens the door to CRUD capabilities. The recommendation is to use this method to build a new index from scratch.

**Notes:**
- These tests are conducted on a local machine, so they don't consider remote db round trip delays.




In [None]:
%pip install -r ../../requirements.txt

In [9]:
import logging
import sys
from llama_index.core import StorageContext
from llama_index.core import VectorStoreIndex
from llama_index.vector_stores.postgres import PGVectorStore
from llama_index.embeddings.azure_openai import AzureOpenAIEmbedding
from llama_index.llms.azure_openai import AzureOpenAI
from llama_index.core import Settings
import os
from dotenv import load_dotenv
from llama_index.storage.index_store.postgres import PostgresIndexStore
from llama_index.storage.docstore.postgres import PostgresDocumentStore
from llama_index.core.node_parser import SentenceSplitter
from pprint import pprint
import psycopg
from llama_index.core.schema import Document
import pickle
from sqlalchemy import make_url
from datetime import datetime
from llama_index.readers.web import SimpleWebPageReader
from bs4 import BeautifulSoup
from llama_index.core.extractors import QuestionsAnsweredExtractor

load_dotenv()

# Uncomment to see debug logs
# logging.basicConfig(stream=sys.stdout, level=logging.DEBUG)
# logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

True

In [4]:
def save_to_pickle(data, filename):
    with open(filename, "wb") as file:
        pickle.dump(data, file)

def load_from_pickle(filename):
    with open(filename, "rb") as file:
        return pickle.load(file)

### Setup LLM and Embed Model

In [5]:
llm = AzureOpenAI(
    model="gpt-4",
    deployment_name="ailab-llm",
    api_key=os.getenv("API_KEY"),
    azure_endpoint=os.getenv("AZURE_ENDPOINT"),
    api_version=os.getenv("API_VERSION"),
)

embed_model = AzureOpenAIEmbedding(
    model="text-embedding-ada-002",
    deployment_name="ada",
    api_key=os.getenv("API_KEY"),
    azure_endpoint=os.getenv("AZURE_ENDPOINT"),
    api_version=os.getenv("API_VERSION"),
)

Settings.llm = llm
Settings.embed_model = embed_model

### Creating a sample document collection

#### Get a list of all urls

In [72]:
conn_string = (
    f"dbname={os.getenv('DB_NAME')} "
    f"user={os.getenv('DB_USER')} "
    f"password={os.getenv('DB_PASSWORD')} "
    f"host={os.getenv('DB_HOST')} "
    f"port={os.getenv('DB_PORT')}"
)
query = """
    SELECT c.url
    FROM louis_v005.crawl as c
    """
with psycopg.connect(conn_string) as conn:
    with conn.cursor() as cur:
        results = cur.execute(query).fetchall()
        urls = [r[0] for r in results]

pprint(urls[0:5])
save_to_pickle(urls, "urls.pkl")


['https://inspection.canada.ca/preventive-controls/sampling-procedures/eng/1518033335104/1528203403149',
 'https://inspection.canada.ca/eng/1664715510668/1664715511012',
 'https://inspection.canada.ca/plant-health/potatoes/potato-varieties/norland/eng/1312587385821/1312587385822',
 'https://inspection.canada.ca/controles-preventifs/lutte-antiparasitaire/fra/1511206644150/1528205213795',
 'https://inspection.canada.ca/eng/1653077788730/1653077789089']


#### Create a sample of nodes

In [7]:
class AiLabWebPageReader(SimpleWebPageReader):
    """AiLab web page reader.

    Reads pages from the web.

    Args:
        html_to_text (bool): Whether to convert HTML to text.
            Requires `html2text` package.
        metadata_fn (Optional[Callable[[str], Dict]]): A function that takes in
            a URL and returns a dictionary of metadata.
            Default is None.
    """

    @classmethod
    def class_name(cls) -> str:
        return "AiLabWebPageReader"
    
    def load_data(self, urls: list[str]) -> list[Document]:
        """Load data from the input directory.

        Args:
            urls (List[str]): List of URLs to scrape.

        Returns:
            List[Document]: List of documents.

        """
        if not isinstance(urls, list):
            raise ValueError("urls must be a list of strings.")
        documents = []
        for url in urls:
            response = requests.get(url, headers=None).text

            metadata: Optional[dict] = None
            if self._metadata_fn is not None:
                metadata = self._metadata_fn(url, response)
            
            if self.html_to_text:
                import html2text
                response = html2text.html2text(response)

            documents.append(Document(text=response, id_=url, metadata=metadata or {}))

        return documents


In [None]:
urls = load_from_pickle("urls.pkl")
nb_pages = 500

def metadata_fn(*args) -> dict:
    url: str = args[0]
    html: str = args[1]
    soup = BeautifulSoup(html, 'html.parser')
    return {
        "url": url,
        "title": soup.title.string.strip(),
        "last_crawled": datetime.now().strftime("%Y-%m-%d"),
        "lang": "fr" if "/fra/" in url else "en",
    }

documents = AiLabWebPageReader(html_to_text=True, metadata_fn=metadata_fn).load_data(
    urls[0:nb_pages]
)
parser = SentenceSplitter()
nodes = parser.get_nodes_from_documents(documents)

In [27]:
pprint(len(documents))
pprint(len(nodes))

500
2783


### Create the Database

In [29]:
# connection_string=conn_string
connection_string = "postgresql://postgres:testpwd@localhost:5432"
db_name = "llamaindex_db_crawl"

with psycopg.connect(connection_string) as conn:
    conn.autocommit = True
    with conn.cursor() as cur:
        cur.execute(psycopg.sql.SQL("DROP DATABASE IF EXISTS {}").format(psycopg.sql.Identifier(db_name)))
        cur.execute(psycopg.sql.SQL("CREATE DATABASE {}").format(psycopg.sql.Identifier(db_name)))

### Create the indexes

In [None]:
url = make_url(connection_string)
vector_store = PGVectorStore.from_params(
    database=db_name,
    host=url.host,
    password=url.password,
    port=url.port,
    user=url.username,
    embed_dim=1536,
)

document_store = PostgresDocumentStore.from_params(    
    database=db_name,
    host=url.host,
    password=url.password,
    port=url.port,
    user=url.username,
)

index_store = PostgresIndexStore.from_params(
    database=db_name,
    host=url.host,
    password=url.password,
    port=url.port,
    user=url.username,
)

storage_context = StorageContext.from_defaults(
    docstore=document_store,
    index_store=index_store, 
    vector_store=vector_store, 
)

storage_context.docstore.add_documents(nodes)

index = VectorStoreIndex(nodes, storage_context=storage_context)



### Query the index

In [None]:
pprint(urls[0:nb_pages])

In [None]:
retriever = index.as_retriever(similarity_top_k=5)
nodes = retriever.retrieve("tissus interdits dans la chaine alimentaire")

### Testing

#### Generating a question from a random url

In [10]:
# urls = load_from_pickle("urls.pkl")
# eng_urls = [url for url in urls if "/fra/" not in url]
# random_url = random.choice(eng_urls)
random_url = urls[0]
documents = SimpleWebPageReader(html_to_text=True).load_data([random_url])
assert len(documents)==1
extractor = QuestionsAnsweredExtractor(questions=1)
questions = await extractor.aextract(documents)


100%|██████████| 1/1 [00:04<00:00,  4.09s/it]


In [11]:
print("url", random_url)
question = questions[0]["questions_this_excerpt_can_answer"].removeprefix("Question: ")
print(question)

url https://inspection.canada.ca/preventive-controls/sampling-procedures/eng/1518033335104/1528203403149
What are the steps and considerations for collecting environmental samples for microbial testing in a food production setting according to the Canadian Food Inspection Agency?


In [15]:
# connection_string=conn_string
connection_string = "postgresql://postgres:testpwd@localhost:5432"
db_name = "llamaindex_db_crawl"
url = make_url(connection_string)
vector_store = PGVectorStore.from_params(
    database=db_name,
    host=url.host,
    password=url.password,
    port=url.port,
    user=url.username,
    embed_dim=1536,
)

index = VectorStoreIndex.from_vector_store(vector_store=vector_store)
retriever = index.as_retriever(similarity_top_k=15)

In [16]:
import time
start_time = time.time()
nodes = retriever.retrieve(question)
end_time = time.time()
elapsed_time = end_time - start_time
print(f"Elapsed time: {elapsed_time:.2f} seconds")

Elapsed time: 0.36 seconds


In [17]:
found = False
for i, n in enumerate(nodes):
    if n.metadata["url"] == random_url:
        found = True
        print(f"Position: {i+1}", n.metadata["title"])

if not found:
    print("Right:", random_url)
    for n in nodes:
        print("Wrong: ", n.metadata["url"])

Position: 1 Sampling procedures - Canadian Food Inspection Agency
Position: 2 Sampling procedures - Canadian Food Inspection Agency
Position: 3 Sampling procedures - Canadian Food Inspection Agency
Position: 4 Sampling procedures - Canadian Food Inspection Agency
Position: 5 Sampling procedures - Canadian Food Inspection Agency
Position: 7 Sampling procedures - Canadian Food Inspection Agency
Position: 9 Sampling procedures - Canadian Food Inspection Agency
Position: 12 Sampling procedures - Canadian Food Inspection Agency
