## Creating an index and populating it with documents using Redis

Simple example on how to ingest PDF documents, then web pages content into a Redis VectorStore.

Requirements:
- A Redis cluster
- A Redis database with at least 2GB of memory (to match with the initial index cap)

### Install Packages

In [None]:
!pip install -qU langchain-huggingface psycopg "psycopg[binary]" langchain-community sentence-transformers redis redisearch redis-py-cluster pypdf

### Base parameters, the Redis info

In [None]:
password="password" #Change this to the password of the redis instance
ragendpoint="redb-rag-instance.redis-rag.svc.cluster.local.local" # Double check the endpoint of the redis instance
port="14283" # Change this to the port of the redis instance
redis_url = "redis://default:"+password+"@"+ragendpoint+":"+port
index_name = "docs"

#### Imports

In [None]:
from langchain.document_loaders import PyPDFDirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings.huggingface import HuggingFaceEmbeddings
from langchain.vectorstores.redis import Redis
from langchain.document_loaders import PyPDFDirectoryLoader, WebBaseLoader

## Initial index creation and document ingestion

#### Download and load pdfs

In [None]:
documents = [
    "release_notes",
    "introduction_to_red_hat_openshift_ai",
    "getting_started_with_red_hat_openshift_ai_self-managed",
    "openshift_ai_tutorial_-_fraud_detection_example",
    "developing_a_model",
    "integrating_data_from_amazon_s3",
    "working_on_data_science_projects",
    "serving_models",
    "monitoring_data_science_models",
    "managing_users",
    "managing_resources",
    "installing_and_uninstalling_openshift_ai_self-managed",
    "installing_and_uninstalling_openshift_ai_self-managed_in_a_disconnected_environment",
    "upgrading_openshift_ai_self-managed",
    "upgrading_openshift_ai_self-managed_in_a_disconnected_environment",   
]

pdfs = [f"https://access.redhat.com/documentation/en-us/red_hat_openshift_ai_self-managed/{product_version}/pdf/{doc}/red_hat_openshift_ai_self-managed-{product_version}-{doc}-en-us.pdf" for doc in documents]
pdfs_to_urls = {f"red_hat_openshift_ai_self-managed-{product_version}-{doc}-en-us": f"https://access.redhat.com/documentation/en-us/red_hat_openshift_ai_self-managed/{product_version}/html-single/{doc}/index" for doc in documents}

In [None]:
import requests
import os

docs_dir = f"rhoai-doc-{product_version}"

if not os.path.exists(docs_dir):
    os.mkdir(docs_dir)

for pdf in pdfs:
    try:
        response = requests.get(pdf)
    except:
        print(f"Skipped {pdf}")
        continue
    if response.status_code!=200:
        print(f"Skipped {pdf}")
        continue  
    with open(f"{docs_dir}/{pdf.split('/')[-1]}", 'wb') as f:
        f.write(response.content)

#### Document loading from a folder containing PDFs

In [None]:
pdf_folder_path = f"./rhoai-doc-{product_version}"

pdf_loader = PyPDFDirectoryLoader(pdf_folder_path)
pdf_docs = pdf_loader.load()

#### Load websites

In [None]:
websites = [
    "https://ai-on-openshift.io/getting-started/openshift/",
    "https://ai-on-openshift.io/getting-started/opendatahub/",
    "https://ai-on-openshift.io/getting-started/openshift-ai/",
    "https://ai-on-openshift.io/odh-rhoai/configuration/",
    "https://ai-on-openshift.io/odh-rhoai/custom-notebooks/",
    "https://ai-on-openshift.io/odh-rhoai/nvidia-gpus/",
    "https://ai-on-openshift.io/odh-rhoai/custom-runtime-triton/",
    "https://ai-on-openshift.io/odh-rhoai/openshift-group-management/",
    "https://ai-on-openshift.io/tools-and-applications/minio/minio/",
    "https://access.redhat.com/articles/7047935",
    "https://access.redhat.com/articles/rhoai-supported-configs",
]

In [None]:
website_loader = WebBaseLoader(websites)
website_docs = website_loader.load()

#### Merge both types of docs

In [None]:
docs = pdf_docs + website_docs

#### Split documents into chunks with some overlap

In [5]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1024,
                                               chunk_overlap=40)
all_splits = text_splitter.split_documents(docs)

#### Create the index and ingest the documents

In [None]:
from langchain.embeddings import HuggingFaceEmbeddings

# Specify the desired model name
model_name = "sentence-transformers/all-MiniLM-L6-v2"  # Example model

# Initialize HuggingFaceEmbeddings with the specified model
embeddings = HuggingFaceEmbeddings(model_name=model_name)

# Proceed with creating the Redis vector store
rds = Redis.from_documents(
    all_splits,
    embeddings,
    redis_url=redis_url,
    index_name=index_name
)

#### Write the schema to a yaml file to be able to open the index later on

In [7]:
rds.write_schema("redis_schema.yaml")

## Ingesting new documents

#### Example with Web pages

In [8]:
from langchain.document_loaders import WebBaseLoader

In [10]:
loader = WebBaseLoader(["https://ai-on-openshift.io/getting-started/openshift/",
                        "https://ai-on-openshift.io/getting-started/opendatahub/",
                        "https://ai-on-openshift.io/getting-started/openshift-data-science/",
                        "https://ai-on-openshift.io/odh-rhods/configuration/",
                        "https://ai-on-openshift.io/odh-rhods/custom-notebooks/",
                        "https://ai-on-openshift.io/odh-rhods/nvidia-gpus/",
                        "https://ai-on-openshift.io/odh-rhods/custom-runtime-triton/",
                        "https://ai-on-openshift.io/odh-rhods/openshift-group-management/",
                        "https://ai-on-openshift.io/tools-and-applications/minio/minio/"
                       ])

In [11]:
data = loader.load()

In [13]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1024,
                                               chunk_overlap=40)
all_splits = text_splitter.split_documents(data)

In [14]:
embeddings = HuggingFaceEmbeddings()
rds = Redis.from_existing_index(embeddings,
                                redis_url=redis_url,
                                index_name=index_name,
                                schema="redis_schema.yaml")

In [None]:
rds.add_documents(all_splits)