In [6]:
!pip install langchain faiss-cpu chromadb sentence-transformers
!pip install ollama


Collecting langchain
  Downloading langchain-0.3.24-py3-none-any.whl.metadata (7.8 kB)
Collecting faiss-cpu
  Downloading faiss_cpu-1.11.0-cp312-cp312-macosx_14_0_arm64.whl.metadata (4.8 kB)
Collecting chromadb
  Downloading chromadb-1.0.7-cp39-abi3-macosx_11_0_arm64.whl.metadata (6.9 kB)
Collecting sentence-transformers
  Downloading sentence_transformers-4.1.0-py3-none-any.whl.metadata (13 kB)
Collecting langchain-core<1.0.0,>=0.3.55 (from langchain)
  Downloading langchain_core-0.3.56-py3-none-any.whl.metadata (5.9 kB)
Collecting langchain-text-splitters<1.0.0,>=0.3.8 (from langchain)
  Downloading langchain_text_splitters-0.3.8-py3-none-any.whl.metadata (1.9 kB)
Collecting langsmith<0.4,>=0.1.17 (from langchain)
  Downloading langsmith-0.3.37-py3-none-any.whl.metadata (15 kB)
Collecting build>=1.0.3 (from chromadb)
  Using cached build-1.2.2.post1-py3-none-any.whl.metadata (6.5 kB)
Collecting chroma-hnswlib==0.7.6 (from chromadb)
  Downloading chroma_hnswlib-0.7.6-cp312-cp312-macos

In [8]:
!pip install langchain chromadb sentence-transformers ollama


Collecting langchain
  Using cached langchain-0.3.24-py3-none-any.whl.metadata (7.8 kB)
Collecting chromadb
  Using cached chromadb-1.0.7-cp39-abi3-macosx_11_0_arm64.whl.metadata (6.9 kB)
Collecting sentence-transformers
  Using cached sentence_transformers-4.1.0-py3-none-any.whl.metadata (13 kB)
Collecting ollama
  Downloading ollama-0.4.8-py3-none-any.whl.metadata (4.7 kB)
Collecting langchain-core<1.0.0,>=0.3.55 (from langchain)
  Using cached langchain_core-0.3.56-py3-none-any.whl.metadata (5.9 kB)
Collecting langchain-text-splitters<1.0.0,>=0.3.8 (from langchain)
  Using cached langchain_text_splitters-0.3.8-py3-none-any.whl.metadata (1.9 kB)
Collecting langsmith<0.4,>=0.1.17 (from langchain)
  Using cached langsmith-0.3.37-py3-none-any.whl.metadata (15 kB)
Collecting build>=1.0.3 (from chromadb)
  Using cached build-1.2.2.post1-py3-none-any.whl.metadata (6.5 kB)
Collecting chroma-hnswlib==0.7.6 (from chromadb)
  Using cached chroma_hnswlib-0.7.6-cp312-cp312-macosx_11_0_arm64.whl.

In [51]:
!pip list | grep langchain
!pip install --upgrade langchain-ollama langchain langchain-community
!pip install -U langchain-huggingface


langchain                                0.3.24
langchain-community                      0.3.22
langchain-core                           0.3.56
langchain-ollama                         0.3.2
langchain-text-splitters                 0.3.8
Collecting langchain-huggingface
  Downloading langchain_huggingface-0.1.2-py3-none-any.whl.metadata (1.3 kB)
Downloading langchain_huggingface-0.1.2-py3-none-any.whl (21 kB)
Installing collected packages: langchain-huggingface
Successfully installed langchain-huggingface-0.1.2


In [3]:
import os
import sys
from langchain_community.vectorstores import Chroma
from langchain_community.llms import Ollama
from langchain.chains import RetrievalQA
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyPDFLoader
from langchain_huggingface import HuggingFaceEmbeddings

# Debug info
print(f"Python version: {sys.version}")
try:
    import langchain
    print(f"LangChain version: {langchain.__version__}")
except Exception as e:
    print("Could not determine LangChain version")

# Vector store path
chroma_db_path = "./chroma_db_hfs"

# Re-embed check
if os.path.exists(chroma_db_path):
    while True:
        user_input = input(
            f"\nChroma vector store found at '{chroma_db_path}'.\n"
            "Do you want to re-embed the documents? (yes/no): "
        ).strip().lower()
        if user_input in ["yes", "y"]:
            re_embed = True
            break
        elif user_input in ["no", "n"]:
            re_embed = False
            break
        else:
            print("Please answer 'yes' or 'no'.")
else:
    re_embed = True

# 1. Load multiple PDFs
try:
    pdf_files = [
        'CAP Theorem.pdf',
        'Key Essentials for Building Application in Cloud.pdf'
    ]

    documents = []
    for pdf_file in pdf_files:
        if os.path.exists(pdf_file):
            loader = PyPDFLoader(pdf_file)
            documents.extend(loader.load())
        else:
            print(f"Warning: File not found: {pdf_file}")

    if not documents:
        raise ValueError("No documents loaded. Check PDF file paths.")

    print(f"Documents loaded: {len(documents)} pages")
except Exception as e:
    print(f"Error loading PDFs: {e}")
    raise

# 2. Split into chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
docs = text_splitter.split_documents(documents)
print(f"Document split into {len(docs)} chunks")

# 3. Create embeddings
try:
    print("Initializing HuggingFace embeddings...")
    embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
except Exception as e:
    print(f"Error initializing embeddings: {e}")
    raise

# 4. Vector store
if re_embed:
    print("Creating new vector store...")
    db = Chroma.from_documents(
        docs,
        embedding=embeddings,
        persist_directory=chroma_db_path
    )
    db.persist()
    print("Vector store persisted.")
    print(f"Vector store contains {db._collection.count()} embedded documents")
else:
    print("Loading existing Chroma vector store...")
    db = Chroma(persist_directory=chroma_db_path, embedding_function=embeddings)

# 5. Retriever
retriever = db.as_retriever(search_type="similarity", search_kwargs={"k": 4})

# 6. LLM
print("Initializing Ollama LLM...")
llm = Ollama(model="llama3.2:latest", temperature=0.1)

# 7. RetrievalQA Chain
qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=retriever,
    return_source_documents=True
)

# 8. Ask a question
query = "What is containing this documents"
print(f"\nQuery: {query}")
print("Generating answer...")

try:
    result = qa.invoke({"query": query})
    print("\nAnswer:", result["result"])
    print("\nSource documents:")
    for i, doc in enumerate(result["source_documents"]):
        print(f"\nDocument {i+1}:")
        print(f"Content: {doc.page_content[:150]}...")
        print(f"Source: Page {doc.metadata.get('page', 'unknown')}")
except Exception as e:
    print(f"Error in query processing: {e}")
    try:
        print("Trying fallback method...")
        result = qa.run(query)
        print("\nAnswer:", result)
    except Exception as e2:
        print(f"Fallback failed: {e2}")


Python version: 3.12.7 | packaged by Anaconda, Inc. | (main, Oct  4 2024, 08:22:19) [Clang 14.0.6 ]
LangChain version: 0.3.24



Chroma vector store found at './chroma_db_hfs'.
Do you want to re-embed the documents? (yes/no):  yes


Documents loaded: 63 pages
Document split into 61 chunks
Initializing HuggingFace embeddings...
Creating new vector store...
Vector store persisted.
Vector store contains 122 embedded documents
Initializing Ollama LLM...

Query: What is containing this documents
Generating answer...

Answer: I don't know.

Source documents:

Document 1:
Content: Concepts
Confidentiality
• Accessible only to authorized parties
• Within cloud environments, confidentiality targets to restricting access to
data in...
Source: Page 12

Document 2:
Content: Concepts
Confidentiality
• Accessible only to authorized parties
• Within cloud environments, confidentiality targets to restricting access to
data in...
Source: Page 12

Document 3:
Content: Public Key Infrastructure (PKI)
• Used to associate public keys with their corresponding key owners
• Rely on the use of digital certificates, which a...
Source: Page 32

Document 4:
Content: Public Key Infrastructure (PKI)
• Used to associate public keys with their c

In [None]:
# Set the system prompt as neutral and focused
system_prompt = "You are an AI that provides detailed, clear, and concise answers based on the document provided. Use the document content to answer questions and provide references where applicable."

# Start the continuous question loop
print("\n=== QA System Ready ===")
print("Type your questions about the document (type 'stop' to exit)\n")

while True:
    # Accept user input (query)
    query = input("Your question: ").strip()

    if query.lower() == "stop":
        print("Exiting the QA system...")
        break
    
    if not query:
        print("Please enter a question or 'stop' to exit")
        continue

    # Generate answer using the provided system prompt and query
    print(f"\nQuery: {query}")
    print("Generating answer...")

    try:
        # Pass the query to the QA system
        result = qa.invoke({"query": query})
        
        # Display the answer and source documents
        print("\nAnswer:", result["result"])
        print("\nSource documents:")
        for i, doc in enumerate(result["source_documents"]):
            print(f"\nDocument {i+1}:")
            print(f"Content: {doc.page_content[:150]}...")
            print(f"Source: Page {doc.metadata.get('page', 'unknown')}")
        
    except Exception as e:
        print(f"Error in query processing: {e}")
        try:
            # Try alternative method for querying
            print("Trying alternative query method...")
            result = qa.run(query)
            print("\nAnswer:", result)
        except Exception as e2:
            print(f"Alternative method also failed: {e2}")



=== QA System Ready ===
Type your questions about the document (type 'stop' to exit)



Your question:  what is the meaning of this book



Query: what is the meaning of this book
Generating answer...

Answer: I don't know the specific meaning of this book, as the provided text appears to be an excerpt from Paulo Coelho's "The Alchemist" and doesn't provide a clear summary or explanation of the book's themes or message. The text seems to be more focused on Coelho's writing process and his philosophy on self-discovery and listening to one's inner voice.

Source documents:

Document 1:
Content: This	is	my	favorite	section	of	the	book.	If	you	can	enrich	your	self-confidence,
passion,	and	connection	with	people,	your	life	will	transform	in	ways...
Source: Page 199

Document 2:
Content: This	is	my	favorite	section	of	the	book.	If	you	can	enrich	your	self-confidence,
passion,	and	connection	with	people,	your	life	will	transform	in	ways...
Source: Page 199

Document 3:
Content: sell—three	thousand,	then	six	thousand,	ten	thousand—book	by	book,	gradually	throughout	the	year.”
The	 book	 became	 an	 organic	 phenomenon	 and	 th...

Your question:  what are the key topics in this book



Query: what are the key topics in this book
Generating answer...

Answer: Based on the provided context, it appears that the book is about self-discovery, personal growth, and finding one's purpose. The key topics seem to include:

1. Self-confidence and passion
2. Connection with people and building relationships
3. Listening to one's inner voice and intuition (as expressed in the quote "The answers are inside of you if you have the courage to listen.")
4. Embracing individuality and uniqueness
5. Courage and taking risks to pursue one's dreams and passions

These topics seem to be central to the book, but without more information or context, it is difficult to provide a more definitive answer.

Source documents:

Document 1:
Content: This	is	my	favorite	section	of	the	book.	If	you	can	enrich	your	self-confidence,
passion,	and	connection	with	people,	your	life	will	transform	in	ways...
Source: Page 199

Document 2:
Content: This	is	my	favorite	section	of	the	book.	If	you	can	enrich	y

Your question:  how it explains about self confident



Query: how it explains about self confident
Generating answer...

Answer: The text doesn't explicitly explain what "the Rule" is, but it does provide examples and quotes from various authors that suggest it's a tool or principle for building confidence through action. The text mentions that doing things that scare you can make you more confident, acting on your fears, and taking action despite feeling nervous or afraid can help build confidence.

The text also quotes Timothy Wilson, who writes about the "do good, be good" intervention, which suggests that changing behavior first can lead to changes in self-perception. This implies that "the Rule" may involve making deliberate choices to take action, express oneself, and push past fears or doubts to build confidence.

However, without more information about what "the Rule" specifically is, it's difficult to provide a more detailed explanation of how it explains self-confidence.

Source documents:

Document 1:
Content: As	Michelle	disco

In [25]:
# %% [markdown]
# ## 1. Ingest PDFs, Embed & Build RetrievalQA (with forced-rebuild support)

# %%
import sys
import logging
from pathlib import Path
from shutil import rmtree
from tqdm.auto import tqdm

from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_community.llms import Ollama
from langchain.chains import RetrievalQA

# ——— Logging & Debug Info ———
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
logging.info(f"Python    : {sys.version.split()[0]}")
try:
    import langchain
    logging.info(f"LangChain : {langchain.__version__}")
except ImportError:
    logging.warning("LangChain not installed or version unknown")

# ——— Configuration ———
PDF_DIR          = Path("./")
CHROMA_DIR       = Path("./chroma_db_sha")
CHUNK_SIZE       = 200
CHUNK_OVERLAP    = 50
EMBED_MODEL_NAME = "all-MiniLM-L6-v2"
LLM_MODEL        = "deepseek-r1:1.5b-qwen-distill-q4_K_M"
RETRIEVE_K       = 4

# %% 
def ingest_and_build(rerun: bool = False):
    """
    1) Loads all PDFs in PDF_DIR
    2) Splits into chunks
    3) Embeds with HuggingFace
    4) Persists or reloads Chroma vector store,
       optionally wiping it on rerun=True
    5) Returns a RetrievalQA chain ready for use
    """
    # 1) Find all PDF files
    pdf_files = sorted(PDF_DIR.glob("*.pdf"))
    if not pdf_files:
        raise FileNotFoundError(f"No PDFs found in {PDF_DIR.resolve()!s}")
    logging.info(f"Found {len(pdf_files)} PDF(s) to load.")

    # 2) Load pages
    documents = []
    for pdf in tqdm(pdf_files, desc="Loading PDFs"):
        loader = PyPDFLoader(str(pdf))
        documents.extend(loader.load())
    logging.info(f"Total pages loaded: {len(documents)}")

    # 3) Split into chunks
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=CHUNK_SIZE, chunk_overlap=CHUNK_OVERLAP
    )
    chunks = splitter.split_documents(documents)
    logging.info(f"Split into {len(chunks)} chunks (size={CHUNK_SIZE}, overlap={CHUNK_OVERLAP})")

    # 4) Embed + Chroma
    embeddings = HuggingFaceEmbeddings(model_name=EMBED_MODEL_NAME)

    # — if rerun, delete any existing DB so we start fresh
    if rerun and CHROMA_DIR.exists():
        logging.info(f"Removing old Chroma directory at {CHROMA_DIR!s}")
        rmtree(CHROMA_DIR)

    if rerun or not CHROMA_DIR.exists():
        logging.info("Creating new Chroma store & embedding…")
        db = Chroma.from_documents(
            chunks,
            embedding=embeddings,
            persist_directory=str(CHROMA_DIR),
        )
        db.persist()
    else:
        logging.info("Loading existing Chroma store.")
        db = Chroma(
            persist_directory=str(CHROMA_DIR),
            embedding_function=embeddings
        )

    # count may be approximate
    try:
        count = db._collection.count()
    except Exception:
        count = "unknown"
    logging.info(f"Chroma contains ~{count} vectors")

    # 5) Build RetrievalQA
    retriever = db.as_retriever(search_type="similarity", search_kwargs={"k": RETRIEVE_K})
    llm       = Ollama(model=LLM_MODEL, temperature=0.1)
    qa_chain  = RetrievalQA.from_chain_type(
        llm=llm, chain_type="stuff", retriever=retriever, return_source_documents=True
    )
    logging.info("RetrievalQA chain ready.")
    return qa_chain

# %%
# Usage example:
# Set rerun=True to force deletion of old DB and full re-embed.
qa = ingest_and_build(rerun=True)


2025-05-09 15:22:22,817 INFO Python    : 3.12.7
2025-05-09 15:22:22,818 INFO LangChain : 0.3.24
2025-05-09 15:22:22,820 INFO Found 2 PDF(s) to load.


Loading PDFs:   0%|          | 0/2 [00:00<?, ?it/s]

2025-05-09 15:22:23,149 INFO Total pages loaded: 63
2025-05-09 15:22:23,151 INFO Split into 125 chunks (size=200, overlap=50)
2025-05-09 15:22:23,153 INFO Use pytorch device_name: mps
2025-05-09 15:22:23,153 INFO Load pretrained SentenceTransformer: all-MiniLM-L6-v2
2025-05-09 15:22:26,650 INFO Creating new Chroma store & embedding…
2025-05-09 15:22:26,651 INFO Anonymized telemetry enabled. See                     https://docs.trychroma.com/telemetry for more information.
2025-05-09 15:22:27,175 INFO Chroma contains ~125 vectors
2025-05-09 15:22:27,177 INFO RetrievalQA chain ready.


In [None]:
# %% [markdown]
# ## 2. Interactive Q&A (with Spinner + Clear “Thinking” State)

# %%
import html
import time
from IPython.display import display, Markdown, HTML, clear_output
from langchain.prompts import ChatPromptTemplate, SystemMessagePromptTemplate, HumanMessagePromptTemplate
from langchain.chains import RetrievalQA
from langchain_community.llms import Ollama

# ——— 1) Build the LLM Instance ———
llm = Ollama(model="deepseek-r1:1.5b-qwen-distill-q4_K_M", temperature=0.1)

# ——— 2) Define System + User Prompt ———
system_message = SystemMessagePromptTemplate.from_template(
    "You are a helpful AI assistant. Use ONLY the provided context to answer the user's question. "
    "If the answer is not in the context, say \"I don't know.\" Do not hallucinate."
)
user_message = HumanMessagePromptTemplate.from_template(
    """
Context:
---------------------
{context}
---------------------

Question: {question}
Answer:
"""
)
chat_prompt = ChatPromptTemplate.from_messages([system_message, user_message])

# ——— 3) (Re)build RetrievalQA chain with our chat prompt ———
# assume `qa` was your old chain—replace it:
qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=qa.retriever,
    return_source_documents=True,
    chain_type_kwargs={"prompt": chat_prompt},
)

# ——— 4) Spinner HTML ———
spinner_html = """
<div style="display:flex;align-items:center">
  <div class="loader" style="
      border: 8px solid #f3f3f3;
      border-top: 8px solid #3498db;
      border-radius: 50%;
      width: 40px;
      height: 40px;
      animation: spin 1s linear infinite;
      margin-right:10px;
    "></div>
  <div><b>Thinking…</b></div>
</div>
<style>
@keyframes spin {
  0%   { transform: rotate(0deg); }
  100% { transform: rotate(360deg); }
}
</style>
"""

# ——— 5) Q&A helper with spinner ———
def answer_question(qa_chain, question: str):
    clear_output(wait=True)
    display(Markdown(f"**Q:** {html.escape(question)}\n"))
    # show spinner
    handle = display(HTML(spinner_html), display_id="spinner")
    # call LLM
    resp = qa_chain.invoke({"query": question})
    # remove spinner
    handle.update(HTML(""))
    # display answer
    ans  = resp["result"]
    srcs = resp.get("source_documents", [])
    display(Markdown(f"---\n## Answer\n{html.escape(ans)}\n"))
    # display sources
    if srcs:
        display(Markdown("### Sources"))
        for i, doc in enumerate(srcs, 1):
            page    = doc.metadata.get("page", "unknown")
            snippet = html.escape(doc.page_content[:200]).replace("\n", " ")
            display(Markdown(f"- **Doc {i}** (page {page}): “{snippet}…”"))

# %%
print("Type questions below (or 'stop' to exit):")
while True:
    query = input("▶ ").strip()
    if query.lower() in ("stop", "exit", "quit"):
        clear_output(wait=True)
        print("Goodbye! 👋")
        break
    if not query:
        continue
    answer_question(qa, query)


**Q:** can you breif the content in this documents and give me the answer sn point format


---
## Answer
&lt;think&gt;
Okay, so I need to figure out how to respond to this user&#x27;s question. They provided a context about cybersecurity and asked for a brief summary with points. Let me break it down.

First, the context mentions that the AI should aim to compromise data confidentiality but note that the attack is passive and can happen undetected for long periods. That&#x27;s important because it highlights the need for secure measures despite the potential risks.

Next, it talks about how data storage, processing, and retrieval are affected by these factors. So, I should include points on how data integrity is maintained during these processes to prevent unauthorized access.

Then, there&#x27;s a section on CSP services and geographical locations like AWS EC2 in specific regions. This indicates that the AI needs to mention how cloud services are chosen based on their region for security purposes.

Public Key Infrastructure (PKI) is another key point here. It explains how PKI associates public keys with owners and uses digital certificates, which are signed digitally. I should make sure to include this as it&#x27;s a fundamental aspect of secure communication.

Putting it all together, the user wants a concise summary in a &quot;sn point&quot; format, meaning each main idea is a separate sentence. So, I&#x27;ll list these points clearly without any markdown or extra text.
&lt;/think&gt;

Certainly! Here&#x27;s a brief summary based on the context:

- Aim to compromise data confidentiality but note that passive attacks can occur undetected for extended periods.

- Data integrity during storage, processing, and retrieval is maintained to prevent unauthorized access.

- Cloud services like AWS EC2 are chosen based on their region for security purposes.

- Public Key Infrastructure (PKI) associates public keys with key owners and uses digital certificates signed digitally.


### Sources

- **Doc 1** (page 19): “• Aim to compromise the confidentiality of the data. • Due to passive nature of the attack, it can take place undetected for extended periods of time.…”

- **Doc 2** (page 13): “received. • Extends to how data is stored, processed, and retrieved.…”

- **Doc 3** (page 5): “• The choice of CSP services and geographical locations. ( e.g: AWS EC2 in specified  AWS region) • How these services integrate into their IT environment.…”

- **Doc 4** (page 32): “Public Key Infrastructure (PKI) • Used to associate public keys with their corresponding key owners • Rely on the use of digital certificates, which are digitally signed data…”