## Install required packages

In [138]:
!pip install qdrant-client langchain langchain_community pypdf openai lmstudio sentence-transformers duckduckgo-search --quiet


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


## Setup models
Here we leverage the LM studio python api.
The local LMStudio server supply the LLM and the embedding model

In [384]:
import lmstudio as lms
# local model for embedding
embedding_model = lms.embedding_model("nomic-embed-text-v1.5")
# chat model
model = lms.llm()

from sentence_transformers import CrossEncoder
# rerank model
# rank_model = CrossEncoder("mixedbread-ai/mxbai-rerank-xsmall-v1")
rank_model = CrossEncoder("mixedbread-ai/mxbai-rerank-base-v1")

config.json:   0%|          | 0.00/967 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/369M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.45k [00:00<?, ?B/s]

spm.model:   0%|          | 0.00/2.46M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/8.65M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/23.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/970 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/49.5k [00:00<?, ?B/s]

## We have to chunk the documents

In [385]:
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

def chunk_pdf(file_path, chunk_size=10000, overlap=500):
    loader = PyPDFLoader(file_path)
    docs = loader.load()

    splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=overlap)
    return splitter.split_documents(docs)

## Let's use a local qdrant vector store
For installation refer to:
https://hub.docker.com/r/qdrant/qdrant

In [386]:
from qdrant_client import QdrantClient, models
import uuid  # Add this import to generate UUIDs
from tqdm import tqdm  # Import tqdm for the progress bar

client = QdrantClient("localhost", port=6333)

def pdf2rag_store(pdf_file, collection_name=None, batch_size=100):
    if collection_name is None:
        collection_name = pdf_file

    # Generate the chunks from the PDF file
    chunks = chunk_pdf(pdf_file)
    points = []

    # Get the first embedding to determine the vector size
    first_chunk = chunks[0]
    first_embedding = embedding_model.embed(first_chunk.page_content)
    vector_size = len(first_embedding)  # Determine vector size from first embedding

    # Check if the collection exists, and create it if not
    try:
        client.get_collection(collection_name=collection_name)
        print(f"Collection '{collection_name}' already exists.")
    except Exception as e:
        print(f"Collection '{collection_name}' does not exist. Creating...")
        
        # Specify the vector configuration (e.g., vector size and distance metric)
        vectors_config = models.VectorParams(
            size=vector_size,
            distance=models.Distance.COSINE  # You can change this to another distance metric if needed
        )
        
        # Create the collection with the vector config
        client.create_collection(
            collection_name=collection_name,
            vectors_config=vectors_config
        )
        print(f"Collection '{collection_name}' created.")

    # Create points for each chunk with tqdm progress bar
    for i, chunk in tqdm(enumerate(chunks), total=len(chunks), desc="Processing Chunks"):
        # Get the embedding for the chunk
        # Optionally you can summarize the chunk with LLM and embedd the summary
        embedding = embedding_model.embed(chunk.page_content)

        # Use a UUID for the point ID instead of a string index
        point_id = str(uuid.uuid4())  # Generate a UUID for each point

        # Create a PointStruct and append it to the points list
        points.append(models.PointStruct(
            id=point_id,  # Use UUID as ID
            vector=embedding,
            payload={"text": chunk.page_content}
        ))

        # If we've reached the batch size, upsert and clear the points list
        if len(points) >= batch_size:
            client.upsert(
                collection_name=collection_name,
                points=points
            )
            print(f"Stored {len(points)} points in collection '{collection_name}'")
            points = []  # Clear the list for the next batch

    # Insert any remaining points if they exist
    if points:
        client.upsert(
            collection_name=collection_name,
            points=points
        )
        print(f"Stored {len(points)} remaining points in collection '{collection_name}'")



## Chunk and store pdf datasources in vector DB

In [521]:
# https://bjpcjp.github.io/pdfs/devops/linux-commands-handbook.pdf
pdf_file = '~/Downloads/linux-commands-handbook.pdf'
pdf2rag_store(pdf_file, "linux-commands-handbook")

In [465]:
# https://www.polygwalior.ac.in/file/20181115101103600592.pdf
pdf_file = '~/Downloads/dos_commands.pdf'
pdf2rag_store(pdf_file, "dos_commands")

Collection 'dos_commands' does not exist. Creating...
Collection 'dos_commands' created.


Processing Chunks: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 91.15it/s]

Stored 3 remaining points in collection 'dos_commands'





In [520]:
# http://ufdcimages.uflib.ufl.edu/AA/00/01/16/99/00001/WorldHistory.pdf
pdf_file = '~/Downloads/WorldHistory.pdf'
pdf2rag_store(pdf_file, "world-history")

In [522]:
# https://www.uhd.edu/documents/provost/us-history.pdf
pdf_file = '~/Downloads/us-history.pdf'
pdf2rag_store(pdf_file, "us-history")

In [256]:
def search(collection_name, query_text, top_k=5):
    # Perform a search for similar vectors
    search_result = client.search(
        collection_name=collection_name,
        query_vector=embedding_model.embed(query_text),
        limit=top_k
    )
    return search_result

## Add websearch capabilities

In [478]:
from duckduckgo_search import DDGS

def web_search(query, top_k=10):
    results = []
    try:
        with DDGS() as ddgs:
            search_results = ddgs.text(query, max_results=top_k)
            results = [
                f">>>>SOURCE WEB<<<: {r['title']} - {r['href']}\n{r['body']}\n\n"
                for r in search_results
            ]
    except Exception as e:
        print(f"Web search failed: {e}")
    return results

## Context Search and Combination in the `rag` Function

The `rag` function searches for context using two main sources:

### 1. **Vector Search with Qdrant**
- It queries multiple collections in a vector database using the input query.
- Each collection contains document embeddings, and the function retrieves the `top_k` most relevant results.
- The results are labeled with their source using the format (just for better visibility for the demo):
">>>SOURCE QDRANT/{collection_name}<<<"

This helps identify where the information originated.

### 2. **Web Search**
- The function performs a web search using the query to fetch up-to-date information.
- Web search results are particularly useful for topics with recent updates or dynamic information.
- The websearch result already prefixed by: ">>>>SOURCE WEB<<<"

### 3. **Combining Results**
- Both the vector search results and the web search results are combined into a single list called `text_list`.
- This combined context ensures the model has access to a diverse and relevant set of information.
- Afterward, a reranking model selects the `rerank_top_k` most relevant documents from `text_list`.
- The final set of reranked documents is used as context to generate an accurate and informed response.



In [482]:
def rag(query, top_k=40, rerank_top_k=10):
    collections = client.get_collections()
    text_list = [
        f">>>>SOURCE QDRANT/{c.name}<<<: {i.payload['text']}\n\n"
        for c in collections.collections
        for i in search(c.name, query, top_k=top_k)
    ]
    # Perform a web search
    web_results = web_search(query, top_k=top_k)
    print(f'Web results: {len(web_results)}')

    # Combine results
    text_list.extend(web_results)
    print(f'Total text_list: {len(text_list)}')
    
    rerank_results = rank_model.rank(query, text_list, return_documents=True, top_k=rerank_top_k)
    concatenated_text = "\n".join(i['text'] for i in rerank_results)
    print(f'context: {len(rerank_results)}')
    prompt = f"""
    {query}
    
    Provide a clear and well-structured answer using **Markdown formatting**.
    Only use the provided context
    
    ### Context:
    {concatenated_text}
    """

    return model.respond(prompt), concatenated_text, text_list


In [483]:
from IPython.display import display, Markdown

def rag_formatted(query):
    res, ctx, text_list = rag(query)
    display(Markdown(res.content))
    return ctx, text_list


In [516]:
ctx, pre_reranked_ctx = rag_formatted("When was Hungary founded?")

  search_result = client.search(


Web results: 40
Total text_list: 163
context: 10


Hungary was founded in 895 AD by Árpád, the leader of the Hungarian tribes. The Kingdom of Hungary was established in 1000 AD when Stephen I was crowned as its first king.

In [518]:
ctx

'>>>>SOURCE WEB<<<: 30 Facts About Hungary - OhMyFacts - https://ohmyfacts.com/world/countries/30-facts-about-hungary/\nHungary was founded in 895 AD by Árpád, the leader of the Hungarian tribes. The Kingdom of Hungary was established in 1000 AD when Stephen I was crowned as its first king. Hungary was part of the Austro-Hungarian Empire from 1867 until its dissolution in 1918. The country has been a member of the European Union since 2004. Cultural Heritage\n\n\n>>>>SOURCE WEB<<<: When was Hungary founded? - Answers - https://www.answers.com/travel-destinations/When_was_Hungary_founded\nHungary was founded in 896.\n\n\n>>>>SOURCE WEB<<<: Brief History of Hungary - English - We love Budapest - https://welovebudapest.com/en/article/2011/02/14/brief-history-of-hungary\nIt was founded in 895 and became a Christian kingdom in 1000 by the crowning of St. Stephan, recognized by the pope. ... of Stephan and his descendants was the stabilization of Christianity and to Europeanize the previousl

In [517]:
# pre_reranked_ctx

In [514]:
ctx, pre_reranked_ctx = rag_formatted("What does umask command do with no arguments?")

  search_result = client.search(


Web results: 40
Total text_list: 163
context: 10


When using the `umask` command without any arguments, it displays the current mask value in octal form.

Here's a summary:

* The `umask` command sets a mask that restricts default permissions.
* Without any arguments, `umask` displays the current user mask in octal form.
* Running `umask` by itself provides the default permissions that will be applied to newly created files and directories.
* The output of `umask` without arguments shows the permission bits that will NOT be set on the newly created files and directories.

### Note: 
The reranked context contain relevant information from the web but also from the embedded document from qdrant

In [515]:
ctx

'>>>>SOURCE WEB<<<: What is Umask in Linux and how to use it effectively? - https://www.rosehosting.com/blog/what-is-umask-in-linux/\nThe bits in the umask command can be changed by invoking the umask command. The syntax of the umask command is the following one: umask [OPTION]... [MODE] Executing this command without arguments or options will return the current value. Let\'s implement it: umask. You should get output with bits like this: root@host:~# umask 0022\n\n\n>>>>SOURCE WEB<<<: umask Cheat Sheet - umask Command Line Guide - https://www.commandinline.com/cheat-sheet/umask/\nThe umask command sets a mask that restricts these default permissions. Basic Syntax: umask [MASK] [MASK]: The permission mask to apply (as an octal value). Without any arguments, umask displays the current mask. How umask Works. Permissions for files: Files cannot have execute permissions by default.\n\n\n>>>>SOURCE WEB<<<: What is umask command for? - Unix & Linux Stack Exchange - https://unix.stackexchange

In [505]:
# pre_reranked_ctx