### Needed packages and imports

In [43]:
!pip install -q einops==0.7.0 langchain==0.1.9 pypdf==4.0.2 pymilvus==2.3.6 sentence-transformers==2.4.0

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [44]:
import requests
import os
from langchain.document_loaders import PyPDFDirectoryLoader, WebBaseLoader, DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings.huggingface import HuggingFaceEmbeddings
from langchain_community.vectorstores import Milvus

### Base parameters, the Milvus connection info

In [45]:
# Replace values according to your Milvus deployment
MILVUS_HOST = "vectordb-milvus.milvus.svc.cluster.local"
MILVUS_PORT = 19530
MILVUS_USERNAME = "root"
MILVUS_PASSWORD = "Milvus"
MILVUS_COLLECTION = "pdf_collection"

## Initial index creation and document ingestion

#### Download and load pdfs and markdown files

In [48]:
pdf_folder_path = f"./docs"

pdf_loader = PyPDFDirectoryLoader(pdf_folder_path)
pdf_docs = pdf_loader.load()

md_loader = DirectoryLoader("./docs", glob="**/*.md")
markdown_docs = md_loader.load()


#### Inject metadata

In [49]:
docs = pdf_docs + markdown_docs

#### Split documents into chunks with some overlap

In [50]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1024,
                                               chunk_overlap=128)
all_splits = text_splitter.split_documents(docs)
all_splits[0]

Document(page_content='Service Mesh Onboarding\nThis guide will walk through the first time setup for onboarding to OpenShift Service Mesh. For more information on Service Mesh, please\nsee the overview page.\nOpenShift Clusters that have the full Service Mesh deployed are known as the Gen2 OpenShift Clusters.\nIf you are a net-new app and have not yet onboarded to ArgoCD yet, please see CD Onboarding\nStep 1: Create New DNS Entries\nIn order to begin migration to Service Mesh, it is recommended you create new DNS entires specifically for the Gen2 clusters.\nIf your application already has DNS records for Gen1, please choose a different hostname for Gen2. After migrating to Gen2 fully, you can\nswitch the old Gen1 records to Gen2 if desired.Note\nCreate Unique DNS For Gen2\n6/25/25, 4:05 PM Service Mesh Onboarding - Platform Docs\nhttps://irma.ups.com/platform/guides/service-mesh/service-mesh-onboarding/ 1/21', metadata={'source': 'docs/Service Mesh Onboarding - Platform Docs.pdf', 'pa

#### Create the index and ingest the documents

In [51]:
# If you don't want to use a GPU, you can remove the 'device': 'cuda' argument
model_kwargs = {}
embeddings = HuggingFaceEmbeddings(
    model_kwargs=model_kwargs,
    show_progress=True
)

# BEWARE: `drop_old` is set to True, so if the collection already existed it will deleted first.
db = Milvus(
    embedding_function=embeddings,
    connection_args={"host": MILVUS_HOST, "port": MILVUS_PORT, "user": MILVUS_USERNAME, "password": MILVUS_PASSWORD},
    collection_name=MILVUS_COLLECTION,
    metadata_field="metadata",
    text_field="page_content",
    auto_id=True,
    drop_old=True
    )

In [52]:
db.add_documents(all_splits)

Batches:   0%|          | 0/3 [00:00<?, ?it/s]

[458982919043574969,
 458982919043574970,
 458982919043574971,
 458982919043574972,
 458982919043574973,
 458982919043574974,
 458982919043574975,
 458982919043574976,
 458982919043574977,
 458982919043574978,
 458982919043574979,
 458982919043574980,
 458982919043574981,
 458982919043574982,
 458982919043574983,
 458982919043574984,
 458982919043574985,
 458982919043574986,
 458982919043574987,
 458982919043574988,
 458982919043574989,
 458982919043574990,
 458982919043574991,
 458982919043574992,
 458982919043574993,
 458982919043574994,
 458982919043574995,
 458982919043574996,
 458982919043574997,
 458982919043574998,
 458982919043574999,
 458982919043575000,
 458982919043575001,
 458982919043575002,
 458982919043575003,
 458982919043575004,
 458982919043575005,
 458982919043575006,
 458982919043575007,
 458982919043575008,
 458982919043575009,
 458982919043575010,
 458982919043575011,
 458982919043575012,
 458982919043575013,
 458982919043575014,
 458982919043575015,
 458982919043

#### Alternatively, add new documents

In [None]:
# If you don't want to use a GPU, you can remove the 'device': 'cuda' argument
# model_kwargs = {'device': 'cuda'}
# embeddings = HuggingFaceEmbeddings(
#     model_kwargs=model_kwargs,
#     show_progress=True
# )

# db = Milvus(
#     embedding_function=embeddings,
#     connection_args={"host": MILVUS_HOST, "port": MILVUS_PORT, "user": MILVUS_USERNAME, "password": MILVUS_PASSWORD},
#     collection_name=MILVUS_COLLECTION,
#     metadata_field="metadata",
#     text_field="page_content",
#     auto_id=True,
#     drop_old=False
#     )

# db.add_documents(all_splits)

#### Test query

In [None]:
query = "Should i create new DNS?"
docs_with_score = db.similarity_search_with_score(query)

In [None]:
for doc, score in docs_with_score:
    print("-" * 80)
    print("Score: ", score)
    print(doc.page_content)
    print("-" * 80)

In [53]:
query = "More about Openshift Users and Groups"
docs_with_score = db.similarity_search_with_score(query)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

In [54]:
for doc, score in docs_with_score:
    print("-" * 80)
    print("Score: ", score)
    print(doc.page_content)
    print("-" * 80)

--------------------------------------------------------------------------------
Score:  0.9325615763664246
OpenShift Image Build Guidelines

To address vulnerabilities with the container engine, OpenShift will run containers using arbitrary user ids that belong to the root group, thus directories and files the application/process needs should belong to the root group. Also consider building images that provide compatibility when running on plain kubernetes and default to run as a non-root user id. S2I image builds typically take care of these concerns, but using multi-stage builds requires Dockerfile user permissions considerations. For example:

text USER 0 RUN chown -R 1001:0 /some/directory && \ chmod -R g=u /some/directory USER 1001

Containers cannot use privileged ports 1-1023 since these require root privileges to bind too

Review Adapting Docker and Kubernetes containers to run on Red Hat OpenShift Container Platform for other important considerations with building and running