# Ingest customer's knowledge base (PDF, webpages, etc.)

The provided code is a script that demonstrates how to ingest documents from the web, process them into smaller chunks for analysis, and prepare them for storage and retrieval using Milvus, a vector database.

Summary of Workflow:
 - Ingest Data:
    - Fetch documents from the specified URLs using WebBaseLoader.
    - Flatten the documents into a list.
 - Preprocess Data:
    - Split the documents into smaller chunks using RecursiveCharacterTextSplitter.
 - Store in Milvus:
    - Generate embeddings using HuggingFaceEmbeddings.
    - Store the vectorized chunks in a Milvus collection.
 - Query and Display Results:
    - Retrieve document chunks based on similarity scores (querying part is implied, though not explicitly shown in the script).
    - Sort results by relevance and display them.

## Step 1 Installing Required libraries
This command installs the necessary libraries:

 - requests: For making HTTP requests to load content from websites.
 - beautifulsoup4: For parsing and extracting content from HTML pages.
 - sentence-transformers: A library used for generating embeddings (vector representations of text).
 - pymilvus: A client for interacting with the Milvus vector database.

In [1]:
!pip install requests beautifulsoup4 sentence-transformers pymilvus langchain-community tiktoken

Collecting tiktoken
  Downloading tiktoken-0.9.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Downloading tiktoken-0.9.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m18.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: tiktoken
Successfully installed tiktoken-0.9.0

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


## Step 2 Imports

The imports include:
 - requests: To send HTTP requests.
 - os: To manage environment variables and paths.
 - LangChain modules:
   - PyPDFDirectoryLoader, WebBaseLoader: To load documents from PDFs or websites.
   - RecursiveCharacterTextSplitter: To break long documents into smaller chunks.
   - HuggingFaceEmbeddings: To generate text embeddings using Hugging Face models.
 - Milvus: For storing and querying vectorized data.

In [2]:
import requests
import os
from langchain.document_loaders import PyPDFDirectoryLoader, WebBaseLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings.huggingface import HuggingFaceEmbeddings
from langchain_community.vectorstores import Milvus

## Step 3 Loading documents from web

 - websites: Contains a list of URLs from which documents will be loaded.
 - WebBaseLoader: A LangChain loader that fetches content from a webpage.
 - docs: Uses the WebBaseLoader to retrieve content from each URL. The result is a list of documents for each website.
 - docs_list: Flattens the list of documents so all content is stored in a single list.

In [3]:
# Load and prepare documents
from langchain_community.document_loaders import WebBaseLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
# List of URLs to load documents from
websites = [
    "https://docs.appdynamics.com/appd/24.x/latest/en/splunk-appdynamics-essentials/getting-started",
    "https://docs.appdynamics.com/appd/24.x/latest/en/application-monitoring",
    "https://docs.appdynamics.com/appd/24.x/latest/en/application-monitoring/overview-of-application-monitoring",
    "https://docs.appdynamics.com/appd/24.x/latest/en/application-monitoring/install-app-server-agents",
    "https://docs.appdynamics.com/appd/24.x/latest/en/application-monitoring/install-app-server-agents/agent-installer",
    "https://docs.appdynamics.com/appd/24.x/latest/en/application-monitoring/install-app-server-agents/agent-installer/monitoring-settings",
    "https://docs.appdynamics.com/appd/24.x/latest/en/application-monitoring/install-app-server-agents/agent-installer/customize-agent-installer",
    "https://docs.appdynamics.com/appd/24.x/latest/en/application-monitoring/install-app-server-agents/agent-installer/secure-agent-installer-platform",
    "https://docs.appdynamics.com/appd/24.x/latest/en/application-monitoring/install-app-server-agents/agent-installer/upgrade-and-rollback-the-agents",
    "https://docs.appdynamics.com/appd/24.x/latest/en/application-monitoring/install-app-server-agents/agent-installer/deploy-an-agent-using-the-agent-installer-api",
    "https://docs.appdynamics.com/appd/24.x/latest/en/application-monitoring/install-app-server-agents/agent-installer/manage-tier-naming-rules",
    "https://docs.appdynamics.com/appd/24.x/latest/en/application-monitoring/install-app-server-agents/agent-installer/enable-server-visibility-and-net-compatibility-mode-with-agent-installer"
]
# Load documents from the URLs
docs = [WebBaseLoader(website).load() for website in websites]
docs_list = [item for sublist in docs for item in sublist]

In [4]:
docs_list


[Document(page_content="\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nGetting Started\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n×\n\n\n\n\n                                                            Topics\n                                                    \n\n\n\n\n\n\n\n\n\n\n\n\n            Home\n        \n\n\n\n            Community\n        \n\n\n\n            Cisco U.\n        \n\n\n\n            Glossary\n        \n\n\nEnglish\n\n\nJapanese\n\n\n\n\n25.4\n\n\n25.4 (latest)\n                                            \n\n\n25.3\n                                            \n\n\n25.2\n                                            \n\n\n25.1\n                                            \n\n\n24.12\n                                            \n\n\n24.11\n    

## Step 4 Splitting documents into chunks

Why Split Documents? Many machine learning models and databases have limits on input size. Breaking the documents into smaller, meaningful chunks allows for efficient processing and storage.

 - RecursiveCharacterTextSplitter:
   - This splits text into chunks of a specified size (250 characters here).
   - It uses a tokenizer-based encoder (from_tiktoken_encoder) to ensure splits don't cut off words or meaningful sections.
   - No chunk overlap is specified (chunk_overlap=0), so chunks won't share content.
 - doc_splits: A list of smaller chunks derived from the original documents.

In [5]:
# Split documents into chunks
# Initialize a text splitter with specified chunk size and overlap
text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    chunk_size=500, chunk_overlap=50  # Larger chunks with some overlap for better context
)
# Split the documents into chunks
doc_splits = text_splitter.split_documents(docs_list)

## Step 5 Preparing Milvus (Vector Database)
Configuration:
 - MILVUS_HOST and MILVUS_PORT: Define the host and port of the Milvus server.
 - MILVUS_USERNAME and MILVUS_PASSWORD: Optional authentication credentials, retrieved from environment variables using os.getenv.
 - MILVUS_COLLECTION: Specifies the name of the Milvus collection where vectors will be stored.

In [6]:
MILVUS_HOST = "vectordb-milvus.milvus.svc.cluster.local"
MILVUS_PORT = 19530
MILVUS_USERNAME = os.getenv('MILVUS_USERNAME')
MILVUS_PASSWORD = os.getenv('MILVUS_PASSWORD')
MILVUS_COLLECTION = "splunk_appdynamics"

## Step 6 Generating Text Embeddings
HuggingFaceEmbeddings:
 - Converts text into vector embeddings using a Hugging Face transformer model.
 - The device: 'cuda' argument ensures the GPU is used for faster processing. If you don't have a GPU, use model_kwargs = {}.
 - show_progress=True: Displays progress during embedding generation.

Milvus Instance:
 - Connects to the Milvus database using the specified host, port, username, and password.
 - embedding_function: Specifies the embedding function (from HuggingFace).
 - collection_name: The name of the Milvus collection.
 - drop_old=True: Deletes the collection if it already exists and creates a new one (be careful with this in production!).
 - auto_id=True: Auto-generates IDs for stored documents.

In [7]:
# Storing document
# If you don't want to use a GPU, you can remove the 'device': 'cuda' argument
model_kwargs = {'device': 'cuda'}
#model_kwargs = {}
embeddings = HuggingFaceEmbeddings(
    model_kwargs=model_kwargs,
    show_progress=True
)

# BEWARE: `drop_old` is set to True, so if the collection already existed it will deleted first.
db = Milvus(
    embedding_function=embeddings,
    connection_args={"host": MILVUS_HOST, "port": MILVUS_PORT, "user": MILVUS_USERNAME, "password": MILVUS_PASSWORD},
    collection_name=MILVUS_COLLECTION,
    metadata_field="metadata",
    text_field="page_content",
    auto_id=True,
    drop_old=True
    )

## Step 7 Adding document to Milvus
add_documents: Adds the split text chunks (doc_splits) into Milvus as vectors after embedding them

In [8]:
doc_splits
db.add_documents(doc_splits)

Batches:   0%|          | 0/5 [00:00<?, ?it/s]

[457869862845663652,
 457869862845663653,
 457869862845663654,
 457869862845663655,
 457869862845663656,
 457869862845663657,
 457869862845663658,
 457869862845663659,
 457869862845663660,
 457869862845663661,
 457869862845663662,
 457869862845663663,
 457869862845663664,
 457869862845663665,
 457869862845663666,
 457869862845663667,
 457869862845663668,
 457869862845663669,
 457869862845663670,
 457869862845663671,
 457869862845663672,
 457869862845663673,
 457869862845663674,
 457869862845663675,
 457869862845663676,
 457869862845663677,
 457869862845663678,
 457869862845663679,
 457869862845663680,
 457869862845663681,
 457869862845663682,
 457869862845663683,
 457869862845663684,
 457869862845663685,
 457869862845663686,
 457869862845663687,
 457869862845663688,
 457869862845663689,
 457869862845663690,
 457869862845663691,
 457869862845663692,
 457869862845663693,
 457869862845663694,
 457869862845663695,
 457869862845663696,
 457869862845663697,
 457869862845663698,
 457869862845

## Step 8

In [13]:
query = "What is an agent?"
docs_with_score = db.similarity_search_with_score(query)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

## Step 9 Retrieving and sorting result
 - docs_with_score: generated from a query to Milvus. Contains document chunks and their similarity scores from the query.
Sorting:
 - The list is sorted by the similarity score in descending order (reverse=True).
Printing Results:
 - The script prints the score and content of each document chunk.

In [16]:
# Sort the docs_with_score list by score in descending order
docs_with_score = sorted(docs_with_score, key=lambda x: x[1], reverse=True)

for doc, score in docs_with_score:
    print("-" * 80)
    print("Score: ", score)
    print("Metadata: ", doc.metadata)

    # Uncomment the lines below if you want to print the document content
    # print(doc.page_content)
    # print("-" * 80)

--------------------------------------------------------------------------------
Score:  1.765830636024475
Metadata:  {'source': 'https://www.transfermarkt.com/', 'title': 'Football transfers, rumours, market values and news | Transfermarkt', 'description': 'Transfermarkt - The football website for transfers, market values, rumours and stats', 'language': 'en'}
--------------------------------------------------------------------------------
Score:  1.7601606845855713
Metadata:  {'source': 'https://www.transfermarkt.com/', 'title': 'Football transfers, rumours, market values and news | Transfermarkt', 'description': 'Transfermarkt - The football website for transfers, market values, rumours and stats', 'language': 'en'}
--------------------------------------------------------------------------------
Score:  1.748345136642456
Metadata:  {'source': 'https://www.fabrizioromano.org/', 'title': 'Fabrizio Romano', 'description': 'highest paid footballers, top 10 richest football players, Cris