# Task
Create agents for reading documentation files and websites related to a particular topic.

## Install necessary libraries

### Subtask:
Install libraries required for web scraping, document loading, and agent creation (e.g., `requests`, `BeautifulSoup`, `langchain`, `langchain-experimental`, etc.).


**Reasoning**:
Install the necessary libraries for web scraping, document loading, and agent creation using pip.



In [None]:
%pip install requests beautifulsoup4 langchain langchain-experimental lxml unstructured



## Load data

### Subtask:
Load data from provided documentation files (e.g., PDF, text files) and scrape content from specified websites related to the topic.


**Reasoning**:
Define the paths to local documentation files and the URLs of websites to scrape, and then use appropriate loaders and scraping techniques to load and scrape the content.



In [3]:
import requests
from bs4 import BeautifulSoup
from langchain_community.document_loaders import PyPDFLoader, TextLoader
import os

# 2. Define website URLs
website_urls = [
  "https://arxiv.org/pdf/2010.07487v3.pdf",
  "https://arxiv.org/pdf/2008.02275v3.pdf",
  "https://arxiv.org/pdf/2401.13481",
  "https://arxiv.org/pdf/2505.07468",
]

# Data structure to store content
all_content = []

# 4. Scrape content from websites
for url in website_urls:
    try:
        response = requests.get(url)
        response.raise_for_status()  # Raise an exception for bad status codes
        soup = BeautifulSoup(response.content, 'lxml')
        # Extract text content (you might need to refine this based on website structure)
        text_content = soup.get_text(separator='\n', strip=True)
        all_content.append({"url": url, "content": text_content})
        print(f"Scraped content from {url}")
    except requests.exceptions.RequestException as e:
        print(f"Error scraping {url}: {e}")
    except Exception as e:
        print(f"Error processing content from {url}: {e}")


# 5. Add a step for uploading PDF files here.
# Once uploaded, you can load them using PyPDFLoader like this:
# pdf_file_path = "path/to/your/uploaded/file.pdf" # Replace with the actual path
# loader = PyPDFLoader(pdf_file_path)
# pdf_documents = loader.load()
# all_content.extend(pdf_documents) # Add loaded PDF documents to all_content


# 6. Display a sample of the stored content
print("\n--- Sample of Loaded/Scraped Content ---")
if all_content:
    # Print content based on type (Document object or dictionary)
    for item in all_content[:2]: # Display first 2 items as sample
        if isinstance(item, dict):
            print(f"Source: {item['url']}\nContent snippet: {item['content'][:500]}...")
        else:
            print(f"Source: {item.metadata.get('source', 'Local File')}\nContent snippet: {item.page_content[:500]}...")
else:
    print("No content was loaded or scraped.")

Scraped content from https://arxiv.org/pdf/2010.07487v3.pdf
Scraped content from https://arxiv.org/pdf/2008.02275v3.pdf
Scraped content from https://arxiv.org/pdf/2401.13481
Scraped content from https://arxiv.org/pdf/2505.07468

--- Sample of Loaded/Scraped Content ---
Source: https://arxiv.org/pdf/2010.07487v3.pdf
Content snippet: %PDF-1.5
%...
Source: https://arxiv.org/pdf/2008.02275v3.pdf
Content snippet: %PDF-1.5
%...


## Process Data

### Subtask:
Split the loaded content into smaller chunks and generate embeddings for each chunk.

**Reasoning**:
Split the text into manageable chunks for the embedding model and then generate embeddings for each chunk.

In [5]:
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.embeddings import HuggingFaceEmbeddings # Example of another embedding model

# 1. Split the documents into smaller chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
# Assuming 'all_content' is a list of strings or objects with a 'content' attribute
texts = []
for item in all_content:
    if isinstance(item, dict):
        texts.append(item['content'])
    else:
        texts.append(item.page_content)

split_texts = text_splitter.create_documents(texts)


# 2. Generate embeddings for each chunk (replace with your chosen embedding model)
# This is a placeholder. You'll need to choose and configure an actual embedding model.
# For example:
embeddings = HuggingFaceEmbeddings()
doc_embeddings = embeddings.embed_documents([t.page_content for t in split_texts])

print(f"Split {len(texts)} documents into {len(split_texts)} chunks.")
print("\n--- Sample of Split Text Chunks ---")
for i, chunk in enumerate(split_texts[:2]):
    print(f"Chunk {i+1}:\n{chunk.page_content[:500]}...\n")

# The next step would be to generate embeddings and store them in a vector store.
# I've left the embedding generation as a placeholder for you to choose your preferred model.

  embeddings = HuggingFaceEmbeddings()
  embeddings = HuggingFaceEmbeddings()
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Split 4 documents into 9 chunks.

--- Sample of Split Text Chunks ---
Chunk 1:
%PDF-1.5
%...

Chunk 2:
%PDF-1.5
%...



## Generate Embeddings and Store in Vector Store

### Subtask:
Generate embeddings for the split text chunks and store them in a vector store.

**Reasoning**:
Generate embeddings for each text chunk using a chosen embedding model and then store these embeddings along with the text chunks in a vector store for efficient retrieval.

In [8]:
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS

vectorstore = FAISS.from_documents(split_texts, embeddings)

print(f"Created a vector store with {len(split_texts)} chunks.")
print("\n--- Vector Store Info ---")

# You can perform similarity searches on the vectorstore
# Example:
query = "What is the main topic of the documents?"
docs = vectorstore.similarity_search(query)
print(f"Most similar documents to the query: {docs}")

Created a vector store with 9 chunks.

--- Vector Store Info ---
Most similar documents to the query: [Document(id='bd244d25-3ad7-4264-9757-ab60cd3b03d2', metadata={}, page_content="external\nEdition of the book in which the document was published\nvolume\nText\nexternal\nPublication volume number\nnumber\nText\nexternal\nPublication issue number within a volume\npageRange\nText\nexternal\nPage range for the document within the print version of its publication\nissn\nText\nexternal\nISSN for the printed publication in which the document was published\neIssn\nText\nexternal\nISSN for the electronic publication in which the document was published\nisbn\nText\nexternal\nISBN for the publication in which the document was published\ndoi\nText\nexternal\nDigital Object Identifier for the document\nurl\nURL\nexternal\nURL at which the document can be found\nbyteCount\nInteger\ninternal\nApproximate file size in octets\npageCount\nInteger\ninternal\nNumber of pages in the print version of the 