## Creating an index and populating it with documents using PostgreSQL+pgvector

Simple example on how to ingest PDF documents, then web pages content into a PostgreSQL+pgvector VectorStore.

Requirements:
- A PostgreSQL cluster with the pgvector extension installed (https://github.com/pgvector/pgvector)
- A Database created in the cluster with the extension enabled (in this example, the database is named `vectordb`. Run the following command in the database as a superuser:
`CREATE EXTENSION vector;`

Note: if your PostgreSQL is deployed on OpenShift, directly from inside the Pod (Terminal view on the Console, or using `oc rsh` to log into the Pod), you can run the command: `psql -d vectordb -c "CREATE EXTENSION vector;"`


### Needed packages

In [6]:
!pip install -q pgvector


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


### Base parameters, the PostgreSQL info

In [7]:
product_version = 2.6
CONNECTION_STRING = "postgresql+psycopg://vectordb:vectordb@postgresql.postgresql.svc.cluster.local:5432/vectordb"
COLLECTION_NAME = f"rhoai-doc-{product_version}"

#### Imports

In [8]:
from langchain.document_loaders import PyPDFDirectoryLoader, WebBaseLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings.huggingface import HuggingFaceEmbeddings
from langchain.vectorstores.pgvector import PGVector

## Initial index creation and document ingestion

#### Download and load pdfs

In [9]:
documents = [
    "release_notes",
    "introduction_to_red_hat_openshift_ai",
    "getting_started_with_red_hat_openshift_ai_self-managed",
    "openshift_ai_tutorial_-_fraud_detection_example",
    "developing_a_model",
    "integrating_data_from_amazon_s3",
    "working_on_data_science_projects",
    "serving_models",
    "monitoring_data_science_models",
    "managing_users",
    "managing_resources",
    "installing_and_uninstalling_openshift_ai_self-managed",
    "installing_and_uninstalling_openshift_ai_self-managed_in_a_disconnected_environment",
    "upgrading_openshift_ai_self-managed",
    "upgrading_openshift_ai_self-managed_in_a_disconnected_environment",   
]

pdfs = [f"https://access.redhat.com/documentation/en-us/red_hat_openshift_ai_self-managed/{product_version}/pdf/{doc}/red_hat_openshift_ai_self-managed-{product_version}-{doc}-en-us.pdf" for doc in documents]
pdfs_to_urls = {f"red_hat_openshift_ai_self-managed-{product_version}-{doc}-en-us": f"https://access.redhat.com/documentation/en-us/red_hat_openshift_ai_self-managed/{product_version}/html-single/{doc}/index" for doc in documents}

In [22]:
import requests
import os
    
rhel_pdf_urls = [
        'https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/9/pdf/performing_a_standard_rhel_9_installation/red_hat_enterprise_linux-9-performing_a_standard_rhel_9_installation-en-us.pdf',
        'https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/9/pdf/performing_an_advanced_rhel_9_installation/red_hat_enterprise_linux-9-performing_an_advanced_rhel_9_installation-en-us.pdf',
        'https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/9/pdf/configuring_basic_system_settings/red_hat_enterprise_linux-9-configuring_basic_system_settings-en-us.pdf',
        'https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/9/pdf/security_hardening/red_hat_enterprise_linux-9-security_hardening-en-us.pdf',
        'https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/9/pdf/composing_a_customized_rhel_system_image/red_hat_enterprise_linux-9-composing_a_customized_rhel_system_image-en-us.pdf',
        'https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/9/pdf/configuring_and_managing_networking/red_hat_enterprise_linux-9-configuring_and_managing_networking-en-us.pdf',
        'https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/9/pdf/upgrading_from_rhel_8_to_rhel_9/red_hat_enterprise_linux-9-upgrading_from_rhel_8_to_rhel_9-en-us.pdf',
]
    

os.mkdir("rhel-doc")

for pdf in rhel_pdf_urls:
    try:
        response = requests.get(pdf)
    except:
        print(f"Skipped {pdf}")
        continue
    if response.status_code!=200:
        print(f"Skipped {pdf}")
        continue  
    with open(f"rhel-doc/{pdf.split('/')[-1]}", 'wb') as f:
        f.write(response.content)  





In [10]:
import requests
import os

os.mkdir(f"rhoai-doc-{product_version}")

for pdf in pdfs:
    try:
        response = requests.get(pdf)
    except:
        print(f"Skipped {pdf}")
        continue
    if response.status_code!=200:
        print(f"Skipped {pdf}")
        continue  
    with open(f"rhoai-doc-{product_version}/{pdf.split('/')[-1]}", 'wb') as f:
        f.write(response.content)

FileExistsError: [Errno 17] File exists: 'rhoai-doc-2.6'

In [11]:
pdf_folder_path = f"./rhoai-doc-{product_version}"

pdf_loader = PyPDFDirectoryLoader(pdf_folder_path)
pdf_docs = pdf_loader.load()

In [23]:
pdf_folder_path = "rhel-doc/"

pdf_loader = PyPDFDirectoryLoader(pdf_folder_path)
pdf_docs = pdf_loader.load()

#### Inject metadata

In [25]:
from pathlib import Path

for doc in pdf_docs:
    doc.metadata["source"] = Path(doc.metadata["source"]).stem

#### Load websites

In [26]:
websites = [
    "https://ai-on-openshift.io/getting-started/openshift/",
    "https://ai-on-openshift.io/getting-started/opendatahub/",
    "https://ai-on-openshift.io/getting-started/openshift-ai/",
    "https://ai-on-openshift.io/odh-rhoai/configuration/",
    "https://ai-on-openshift.io/odh-rhoai/custom-notebooks/",
    "https://ai-on-openshift.io/odh-rhoai/nvidia-gpus/",
    "https://ai-on-openshift.io/odh-rhoai/custom-runtime-triton/",
    "https://ai-on-openshift.io/odh-rhoai/openshift-group-management/",
    "https://ai-on-openshift.io/tools-and-applications/minio/minio/",
    "https://access.redhat.com/articles/7047935",
    "https://access.redhat.com/articles/rhoai-supported-configs",
]

In [27]:
website_loader = WebBaseLoader(websites)
website_docs = website_loader.load()

#### Merge both types of docs

In [28]:
docs = pdf_docs + website_docs

#### Split documents into chunks with some overlap

In [29]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1024,
                                               chunk_overlap=40)
all_splits = text_splitter.split_documents(docs)
all_splits[0]

Document(page_content='Red Hat Enterprise Linux\n \n9\nSecurity hardening\nEnhancing security of Red Hat Enterprise Linux 9 systems\nLast Updated: 2024-04-03', metadata={'source': 'red_hat_enterprise_linux-9-security_hardening-en-us', 'page': 0})

#### Cleanup documents as PostgreSQL won't accept the NUL character, '\x00', in TEXT fields.

In [30]:
for doc in all_splits:
    doc.page_content = doc.page_content.replace('\x00', '')

#### Create the index and ingest the documents

In [31]:
embeddings = HuggingFaceEmbeddings()

db = PGVector.from_documents(
    documents=all_splits,
    embedding=embeddings,
    collection_name=COLLECTION_NAME,
    connection_string=CONNECTION_STRING,
    #pre_delete_collection=True # This deletes existing collection and its data, use carefully!
)

#### Alternatively, add new documents

In [None]:
# embeddings = HuggingFaceEmbeddings()

# db = PGVector(
#     connection_string=CONNECTION_STRING,
#     collection_name=COLLECTION_NAME,
#     embedding_function=embeddings)

# db.add_documents(all_splits)

#### Test query

In [34]:
query = "How do you install OpenShift Data Science?"
#query = "How can I upgrade from rhel 8 to rhel 9?"
docs_with_score = db.similarity_search_with_score(query)

In [35]:
for doc, score in docs_with_score:
    print("-" * 80)
    print("Score: ", score)
    print(doc.page_content)
    print("-" * 80)

--------------------------------------------------------------------------------
Score:  0.29386721289949214
Installing the Red Hat OpenShift AI Operator
.
4
. 
Install OpenShift AI components. See 
Installing and managing Red Hat OpenShift AI
components
.
5
. 
Configure user and administrator groups to provide user access to OpenShift AI. See 
Adding
users
.
6
. 
Access the OpenShift AI dashboard. See 
Accessing the OpenShift AI dashboard
.
7
. 
Optionally, enable graphics processing units (GPUs) in OpenShift AI to ensure that your data
scientists can use compute-heavy workloads in their models. See 
Enabling GPU support in
OpenShift AI
.
Red Hat OpenShift AI Self-Managed 2.6 Installing and uninstalling OpenShift AI Self-Managed
6
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Score:  0.29386721289949214
Installing the Red Hat OpenShift AI Operator
.
4
. 
Install OpenShif