## Developing LLM Based Retrieval-Augmented Generation (RAG) 

This notebook demonstrates how to combine document processing, web data retrieval, semantic search, and language model generation to create an effective question-answering system.

📄 PDF Extraction
Local PDF files are loaded and their textual content is extracted using the PyMuPDF library. The extracted text is then segmented into smaller, manageable chunks to facilitate semantic processing by embedding models and language models.

🌐 Google Search
Based on the user’s query, the system performs a free Google search to retrieve the most relevant web pages. The top paragraphs from each page are scraped using requests and BeautifulSoup to supplement the PDF content with up-to-date, real-world information.

🧠 Embedding
All text chunks—whether from PDFs or web pages—are converted into vector embeddings using a pre-trained Sentence Transformer (all-MiniLM-L6-v2). These embeddings capture the semantic meaning of each chunk, enabling effective similarity comparisons.

🧠 Vector DB (ChromaDB)
ChromaDB is used to store the generated embeddings alongside their corresponding text chunks. During query time, the system retrieves the most relevant content by performing a similarity search against the user’s question embedding.

💬 Language Model
A language model (accessed via GroqCloud) generates a contextualized answer by incorporating both the user’s question and the most relevant chunks retrieved from ChromaDB. This step leverages the model’s capacity to reason and respond in natural language.

🖥️ Main Interface
All components are orchestrated into a streamlined question-answering pipeline. The interface handles PDF processing, online search, embedding generation, vector retrieval, and final answer generation—providing an interactive and user-friendly experience.

In [30]:
import os
import fitz  # PyMuPDF
import requests
from bs4 import BeautifulSoup
from googlesearch import search
import chromadb
from sentence_transformers import SentenceTransformer
import torch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
from groq import Groq

# === PDF Text Extraction ===
def extract_text_from_pdf(pdf_path):
    doc = fitz.open(pdf_path)
    return "\n".join(page.get_text() for page in doc)

def extract_all_pdfs_text(folder_path="pdf"):
    return "\n".join(
        extract_text_from_pdf(os.path.join(folder_path, f))
        for f in os.listdir(folder_path)
        if f.lower().endswith(".pdf")
    )

def chunk_text(text, max_length=500):
    sentences = text.split('. ')
    chunks, chunk = [], ""
    for sentence in sentences:
        if len(chunk) + len(sentence) < max_length:
            chunk += sentence + ". "
        else:
            chunks.append(chunk.strip())
            chunk = sentence + ". "
    if chunk:
        chunks.append(chunk.strip())
    return chunks

# === Google Search and Webpage Extraction ===
def free_google_search(query, num_results=5):
    return list(search(query, num_results=num_results))

def fetch_text_from_url(url):
    try:
        headers = {'User-Agent': 'Mozilla/5.0'}
        res = requests.get(url, headers=headers, timeout=5)
        res.raise_for_status()
        soup = BeautifulSoup(res.text, 'html.parser')
        return " ".join([p.get_text() for p in soup.find_all('p')[:5]])
    except Exception as e:
        print(f"Error fetching {url}: {e}")
        return ""

# === ChromaDB Setup ===
embedder = SentenceTransformer('all-MiniLM-L6-v2')
client_db = chromadb.Client()
collection_name = "pdf_google_free_search"

try:
    collection = client_db.get_collection(name=collection_name)
except:
    collection = client_db.create_collection(name=collection_name)

def add_to_chromadb(text_chunks):
    embeddings = embedder.encode(text_chunks).tolist()
    for i, (chunk, embedding) in enumerate(zip(text_chunks, embeddings)):
        collection.add(
            ids=[f"chunk_{i}"],
            documents=[chunk],
            embeddings=[embedding]
        )

def retrieve_similar_chunks(query, top_k=5):
    query_embedding = embedder.encode([query]).tolist()
    results = collection.query(
        query_embeddings=query_embedding,
        n_results=top_k
    )
    return results['documents'][0]

# === Groq LLaMA-3 API ===
groq_API_key = "groq_API_key"  
groq_client = Groq(api_key=groq_API_key)

def generate_answer_groq_llama(query, context_chunks):
    context = "\n".join(context_chunks)
    prompt = f"Soru: {query}\nBilgi: {context}\nCevap:"

    completion = groq_client.chat.completions.create(
        model="llama3-70b-8192",
        messages=[{"role": "user", "content": prompt}],
        temperature=1,
        max_completion_tokens=1024,
        top_p=1,
        stream=True,
        stop=None,
    )

    print("\nAnswer:")
    for chunk in completion:
        print(chunk.choices[0].delta.content or "", end="", flush=True)
    print("\n")

# === Main Pipeline ===
def main():
    pdf_folder = "pdf"
    pdf_text = extract_all_pdfs_text(pdf_folder)
    pdf_chunks = chunk_text(pdf_text)

    user_query = input("Question: ")

    urls = free_google_search(user_query)
    google_texts = [fetch_text_from_url(url) for url in urls if url]

    all_chunks = pdf_chunks + google_texts
    all_chunks = [chunk for chunk in all_chunks if len(chunk.split()) > 20]

    existing = collection.get()
    if "ids" in existing and existing["ids"]:
        collection.delete(ids=existing["ids"])

    add_to_chromadb(all_chunks)
    retrieved_chunks = retrieve_similar_chunks(user_query, top_k=5)

    generate_answer_groq_llama(user_query, retrieved_chunks)

if __name__ == "__main__":
    main()

# Question 1    
# Who is Searle?

Question: Who is Searle?

Answer:
John Searle is an American philosopher best known for his work in the philosophy of language, especially speech act theory, and the philosophy of mind.



In [26]:
main() 
# Question 2
# Explain the argument called "Chinese Room Argument". 
# In your explanation, outline the premises of the argument.


Question: explain the argument called "Chinese Room Argument". In your explanation, outline the premises of the argument.

Answer:
The Chinese Room Argument" is a thought experiment introduced by philosopher John Searle in 1980, which challenges the idea that computers or artificial intelligence (AI) can truly think or understand language. Here are the premises:

**Premise 1:** Imagine a person who does not speak Chinese is locked in a room with a large set of rules and Chinese characters. This person receives Chinese input through a slot in the door and responds with Chinese output, also through the slot.

**Premise 2:** The person inside the room does not understand Chinese, but is able to produce Chinese output by following the rules provided.

**Premise 3:** The output produced by the person in the room is indistinguishable from the output produced by a native Chinese speaker.

**Premise 4:** Despite the person in the room producing Chinese output that is indistinguishable from a n

In [27]:
main()
# Question 3
# Write a list of Authors that counter attacks or objects Searle's Chinese Room Argument.


Question: Write a list of Authors that counter attacks or objects Searle's Chinese Room Argument.

Answer:
Here is a list of authors who have countered or objected to Searle's Chinese Room Argument:

1. Anatoly Mickevich (pseudonym A. I. Mikhailov) - proposed a thought experiment involving myriad humans acting as a computer in 1961.
3. Schank - developed AI programs that Searle's argument was originally responding to.
4. Ned Block - argued that Searle's argument relies on an unrealistic simplification of the Chinese Room scenario.
5. Hilary Putnam - argued that the Chinese Room thought experiment is too narrow and ignores the role of context and embodied cognition.
6. Daniel Dennett - argued that Searle's argument relies on a mistaken understanding of artificial intelligence and the nature of computation.
7. David Chalmers - argued that Searle's argument is based on an unrealistic assumption that the Chinese Room system lacks any kind of internal mental states.
8. Stevan Harnad - argue

In [28]:
main()
# Question 4
# What is the most promising argument against Searl's Chinese Room Argument? 
# Compare two arguments premise by premise.


Question: What is the most promising argument against Searl's Chinese Room Argument? Compare two arguments premise by premise.
Error fetching /search?num=7: Invalid URL '/search?num=7': No scheme supplied. Perhaps you meant https:///search?num=7?

Answer:
One of the most promising arguments against Searle's Chinese Room Argument is the "Systems Reply" or "Systems Response". Here's a comparison of the two arguments premise by premise:

**Searle's Chinese Room Argument**

1. Premise: A person who understands no Chinese is locked in a room with a set of rules and Chinese characters.
3. Premise: The person processes the Chinese characters according to the rules, producing Chinese sentences that are indistinguishable from those written by a native Chinese speaker.
4. Conclusion: The person in the room does not understand Chinese, despite producing sentences that are indistinguishable from those written by a native speaker.

**The Systems Reply**

1. Premise: The system (person + rules + Chi