# **Web Scraper and Data Processor with LLM**

This notebook demonstrates how to:
1. **Scrape web content** from specified URLs.
2. **Extract and clean** the body content from the HTML.
3. **Split the content** into manageable chunks.
4. **Use a Language Model (LLM)** to parse and extract specific information based on a given prompt.

In [None]:
!pip install requests beautifulsoup4 langchain_ollama langchain_core httpx pandas

## 1. Importing Libraries

We will start by importing the necessary libraries for web scraping, data processing, and interacting with the LLM.

In [1]:
# Importing the necessary libraries
import requests  # For making HTTP requests
from bs4 import BeautifulSoup  # For parsing the website's HTML content
from langchain_ollama import OllamaLLM  # Importing the LLM interface
from langchain_core.prompts import ChatPromptTemplate  # For templating the prompts

## 2. Define the LLM Integration and Processing Functions

Template and LLM Initialization

Define the prompt template and initialize the LLM model.

In [21]:
# Define the prompt template for extracting specific information from web content
prompt_template = (
    "You are tasked with extracting specific information from the following text content: {dom_content}. "
    "Please follow these instructions carefully: \n\n"
    "1. **Extract Information:** Only extract the information that directly matches the provided description: {parse_description}. "
    "2. **No Extra Content:** Do not include any additional text, comments, or explanations in your response. "
    "3. **Empty Response:** If no information matches the description, return an empty string ('')."
    "4. **Direct Data Only:** Your output should contain only the data that is explicitly requested, with no other text."
)

# Initialize the LLaMA3 model (Ensure that the server is running on localhost and port is correct)
llama_model = OllamaLLM(model="llama3.1", host="localhost", port=11434)


## 3. Define the SmartScraper Class
This class will handle the scraping, content extraction, cleaning, and LLM processing.

In [18]:
# Define the SmartScraper class that integrates web scraping and LLM-based parsing
class SmartScraper:
    def __init__(self, url, llm_model, prompt):
        """
        Initialize the SmartScraper object with the URL, LLM model, and prompt.
        :param url: URL of the website to scrape.
        :param llm_model: Initialized LLaMA model object.
        :param prompt: Prompt template for LLaMA model.
        """
        self.url = url
        self.llm_model = llm_model
        self.prompt_template = prompt

    def scrape_website(self):
        """
        Fetch the website content from the provided URL.
        :return: HTML content of the webpage or empty string if request fails.
        """
        try:
            print("Connecting to the website...")
            response = requests.get(self.url)
            response.raise_for_status()  # Check for HTTP errors
            print("Successfully connected and retrieved content.")
            return response.text
        except requests.exceptions.RequestException as e:
            print(f"An error occurred while fetching the URL: {e}")
            return ""

    def extract_body_content(self, html_content):
        """
        Extract the main body content from the HTML of the webpage.
        :param html_content: Full HTML content of the webpage.
        :return: Extracted body content as a string.
        """
        soup = BeautifulSoup(html_content, "html.parser")
        body_content = soup.body
        if body_content:
            return str(body_content)
        return ""

    def clean_body_content(self, body_content):
        """
        Clean the body content by removing scripts, styles, and unnecessary whitespace.
        :param body_content: HTML content of the body section.
        :return: Cleaned plain text content.
        """
        soup = BeautifulSoup(body_content, "html.parser")

        # Remove script and style elements
        for script_or_style in soup(["script", "style"]):
            script_or_style.extract()

        # Get cleaned text, stripping excess whitespace and newlines
        cleaned_content = soup.get_text(separator="\n")
        cleaned_content = "\n".join(line.strip() for line in cleaned_content.splitlines() if line.strip())

        return cleaned_content

    def split_content_into_chunks(self, content, max_length=6000):
        """
        Split the content into smaller chunks to ensure it fits within model input limits.
        :param content: The full cleaned content of the webpage.
        :param max_length: Maximum allowed length per chunk.
        :return: List of content chunks.
        """
        return [content[i:i + max_length] for i in range(0, len(content), max_length)]

    def parse_with_llama(self, content_chunks, parse_description):
        """
        Use the LLaMA model to parse and extract specific information from the content.
        :param content_chunks: List of content chunks.
        :param parse_description: The description of the information to extract.
        :return: Parsed result as a string.
        """
        # Initialize the prompt template
        prompt = ChatPromptTemplate.from_template(self.prompt_template)

        # Combine prompt template with LLM model
        chain = prompt | self.llm_model

        # Collect parsed results
        parsed_results = []

        for i, chunk in enumerate(content_chunks, start=1):
            response = chain.invoke({"dom_content": chunk, "parse_description": parse_description})
            print(f"Processing chunk {i} of {len(content_chunks)}")
            parsed_results.append(response)

        return "\n".join(parsed_results)

    def run(self, parse_description):
        """
        Run the full scraping and parsing workflow.
        :param parse_description: The description of the data to extract.
        :return: Parsed result or error message.
        """
        # Step 1: Scrape the website content
        dom_content = self.scrape_website()
        if not dom_content:
            return "Failed to retrieve content from the website."

        # Step 2: Extract and clean body content
        body_content = self.extract_body_content(dom_content)
        cleaned_content = self.clean_body_content(body_content)

        # Step 3: Split content into manageable chunks
        content_chunks = self.split_content_into_chunks(cleaned_content)

        # Step 4: Parse the content using LLaMA and the provided prompt
        parsed_result = self.parse_with_llama(content_chunks, parse_description)
        return parsed_result


## 4. Example Usage
Define Sample URLs and Prompts

Here we provide URLs and prompts to demonstrate the functionality of SmartScraper.

In [19]:

# Define sample URLs and prompts
urls = [
    "https://argentwork.com/",
    "https://www.reservebar.com/"
]

prompts = [
    "Extract all inventory related details in table format.",
    "Extract all inventory related details in table format"
]



## 5. Create and Run the Scraper Instances

Instantiate SmartScraper for each URL and prompt, then execute the scraping and parsing process.

In [22]:
# Example execution for each URL and prompt
for url, prompt in zip(urls, prompts):
    scraper = SmartScraper(url, llama_model, prompt)
    result = scraper.run(prompt)
    print(f"Results for {url}:")
    print(result)
    print("\n" + "="*50 + "\n")

Connecting to the website...
Successfully connected and retrieved content.
Processing chunk 1 of 2
Processing chunk 2 of 2
Results for https://argentwork.com/:
Here are the extracted inventory-related details in a table format:

**Inventory Details**

| **Category** | **Description** |
| --- | --- |
| **Item Name** | Product/Service description |
| **Quantity** | Number of units available |
| **Unit Price** | Cost per unit |
| **Total Value** | Total cost of all items in stock |
| **Low Stock Threshold** | Alert level to reorder items |
| **Reorder Point** | Quantity at which to automatically restock |
| **Supplier Information** | Contact details and terms for suppliers |

Let me know if you'd like me to add any other inventory-related columns!
Here are the extracted inventory-related details in a table format:

**Inventory Details**

| **Category** | **Description** |
| --- | --- |
| **Current Inventory Level** | 500 units (as of today) |
| **Product Variants** | Standard, Premium, an

## **Conclusion**
This notebook demonstrates a complete workflow for scraping web content, cleaning and processing it, and using an LLM to extract specific information. Customize the URLs and prompts as needed for your use case.


### Notes:
- **Port Number**: Ensure the `port` specified in `OllamaLLM` is the correct one where the Ollama server is running.
- **Server Connection**: Make sure the Ollama server is up and running before executing the notebook.
- **Error Handling**: The `ConnectError` indicates that the connection to the Ollama server is failing; double-check the server address and port.

