### Second part of the assignment

## Web Scraper + RAG Agent Pipeline

This agent:
- uses a websearch agent (duckduckgo search), to find  5 relevant site URLs for a given query
- scrapes the main textual content from these URLs
- Applys a RAG pipeline to this scraped data in context of the query given
  - chunks, embeds and retrieves the most relevant context for the LLM (groq)
  - gives a final answer based on this RAG pipeline
- (using `beautifulsoup4` for scraping data)

In [106]:
from duckduckgo_search import DDGS
import requests
from bs4 import BeautifulSoup
from langchain_groq import ChatGroq
from langchain_cohere import CohereEmbeddings
from langchain.vectorstores import FAISS
from langchain.chains import RetrievalQA
from langchain.docstore.document import Document
from dotenv import load_dotenv
import os
import re

load_dotenv()
GROQ_API_KEY = os.getenv("GROQ_API_KEY")
llm = ChatGroq(
    #model_name="llama-3.3-70b-versatile",
    model_name="gemma2-9b-it",
    temperature=0.7
)

import warnings
warnings.filterwarnings('ignore', category=Warning)

### WebSearch Agent to find top 5 URLs

In [107]:
def ddg_url_search(query: str, max_results: int = 5):
    try:
        with DDGS() as ddgs:
            results = ddgs.text(query, region='wt-wt', safesearch="off", max_results=max_results)
            urls = []
            for r in results:
                url = r.get("href", "")
                if url and url.startswith("http"):
                    urls.append(url)
            return urls if urls else []
    except Exception as e:
        print(f"Error during DDG search: {e}")
        return [] 
                

### WebScraper to extract main textual content from top 5 URLs

(Using requests + BeautifulSoup to get readable text from each URL.)

In [108]:
def scrape_from_url(url: str) -> str:
    try:
        resp = requests.get(url, timeout=10, headers={"User-Agent": "Mozilla/5.0"})
        soup = BeautifulSoup(resp.text, 'html.parser')
        
        # remove script and style elements
        for tag in soup(['script', 'style', 'head', 'meta', 'noscript']):
            tag.decompose()
        # get all the text
        text = soup.get_text(separator='\n')
        # cleaning the text
        text = re.sub(r'\s+','',text)
        
        # truncate long outputs (not sure if this is removing context, but setting it to max 3000 for now)
        if len(text) > 3000:
            text = text[:3000]
            print("Scraping Complete! Text is truncated to 3000 characters.")
        
        return text
    except Exception as e:
        print(f"Error scraping {url}: {str(e)}")
        return ""

### Creating the RAG document store
(using `cohere` for vector embeddings, and `FAISS` as the vectorstore)

In [109]:
COHERE_API_KEY = os.getenv("COHERE_API_KEY")

def create_rag_store_cohere(scraped_texts, urls):
    docs = []
    for txt, url in zip(scraped_texts, urls):
        if not txt.startswith("ERROR") and len(txt.strip()) > 0:
            docs.append(Document(page_content=txt, metadata={"source": url}))
    if not docs:
        raise ValueError("No valid documents to index! Please check the scraping step and input URLs.")
    embeddings = CohereEmbeddings(model="embed-english-v3.0")
    vectorstore = FAISS.from_documents(docs, embeddings)
    return vectorstore

### RAG Retrieval and QA Chain

In [110]:
def rag_answer(query:str, vectorstore):
    retriever = vectorstore.as_retriever(search_kwargs={"k":5})
    qa_chain = RetrievalQA.from_chain_type(
        llm=llm,
        chain_type="stuff",
        retriever=retriever,
        return_source_documents=True
    )
    return qa_chain({"query":query})


### Full agent:

In [111]:
def agent(query):
    urls = ddg_url_search(query)
    if not urls:
        return "No relevant URLs found."
    
    scraped_texts = [scrape_from_url(url) for url in urls]
    vectorstore = create_rag_store(scraped_texts, urls) 
    if not vectorstore:
        return "No valid documents found to create a vector store."
    
    answer = rag_answer(query, vectorstore)
    if not answer:
        return "No answer found."
    
    response = f"Answer: {answer['result']}\n\nSources:\n"

## Testing

In [112]:
question = "What is the latest AI research from within 3 months ago?"

print("============Step 1: Searching for 5 relevant URLs...========\n")
urls = ddg_url_search(question)
print("Found URLs:", urls)

print("============Step 2: Scraping sites...==========\n")
scraped = [scrape_from_url(u) for u in urls]

print("============Step 3: Building RAG vector store...========\n")
vector_db = create_rag_store(scraped, urls)

print("========Step 4: Asking final question to RAG + LLM agent...==========\n")
result = rag_answer(question, vector_db)

print("\nGenerated Answer:\n", result['result'])
print("\nSource Documents:")
for doc in result['source_documents']:
    print(f"- Source: {doc.metadata['source']}")





  with DDGS() as ddgs:


Found URLs: ['https://indianexpress.com/section/technology/artificial-intelligence/', 'https://news.google.com/topics/CAAqJAgKIh5DQkFTRUFvSEwyMHZNRzFyZWhJRlpXNHRSMElvQUFQAQ', 'https://aimagazine.com/articles/this-weeks-top-five-stories-in-ai', 'https://datahubanalytics.com/top-10-ai-developments-in-the-past-week-week-1-march-2025/', 'https://www.sciencedaily.com/news/computers_math/artificial_intelligence/']

Scraping Complete! Text is truncated to 3000 characters.
Scraping Complete! Text is truncated to 3000 characters.
Scraping Complete! Text is truncated to 3000 characters.
Scraping Complete! Text is truncated to 3000 characters.
Scraping Complete! Text is truncated to 3000 characters.



/opt/homebrew/Caskroom/miniconda/base/envs/newvenv/lib/python3.13/site-packages/cohere/core/pydantic_utilities.py:240: PydanticDeprecatedSince211: Accessing the 'model_fields' attribute on the instance is deprecated. Instead, you should access this attribute from the model class. Deprecated in Pydantic V2.11 to be removed in V3.0.
  return cast(Mapping[str, PydanticField], model.model_fields)  # type: ignore[attr-defined]
/opt/homebrew/Caskroom/miniconda/base/envs/newvenv/lib/python3.13/site-packages/cohere/utils.py:233: PydanticDeprecatedSince20: The `parse_obj` method is deprecated; use `model_validate` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
  embeddings_by_type_merged = EmbedByTypeResponseEmbeddings.parse_obj(merged_dicts)





/opt/homebrew/Caskroom/miniconda/base/envs/newvenv/lib/python3.13/site-packages/cohere/core/pydantic_utilities.py:240: PydanticDeprecatedSince211: Accessing the 'model_fields' attribute on the instance is deprecated. Instead, you should access this attribute from the model class. Deprecated in Pydantic V2.11 to be removed in V3.0.
  return cast(Mapping[str, PydanticField], model.model_fields)  # type: ignore[attr-defined]
/opt/homebrew/Caskroom/miniconda/base/envs/newvenv/lib/python3.13/site-packages/cohere/utils.py:233: PydanticDeprecatedSince20: The `parse_obj` method is deprecated; use `model_validate` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
  embeddings_by_type_merged = EmbedByTypeResponseEmbeddings.parse_obj(merged_dicts)



Generated Answer:
 Here are some AI research highlights from the past 3 months, based on the provided text: 

* **Empathy in AI:** Experts like Geoffrey Hinton (considered the "Godfather of AI") and Yann LeCun (Meta's AI chief) emphasize the importance of empathy in AI development to prevent negative consequences.
* **AI-Powered Crime Prediction:** The UK government is developing an AI-powered system to predict criminal activity like theft, knife attacks, and violent crimes.
* **Fake Job Applicant Profiles:** Gartner predicts that by 2028, 25% of job applicant profiles will be AI-generated.
* **AI for Job Seekers:** There's growing concern about AI tools being used to create fake job applicant profiles and potentially deceive employers.
* **Small AI Models for Mobile Devices:** Google released Gemma3270M, a compact AI model designed to run efficiently on mobile phones and edge devices. 
* **OpenAI and India:** OpenAI CEO Sam Altman talked about the immense potential of AI in India, hi