<a href="https://colab.research.google.com/github/alexfazio/firecrawl-quickstarts/blob/main/claude_researcher_with_map.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Firecrawl Web Crawling with OpenAI and Anthropic
This notebook demonstrates how to use the Firecrawl API along with OpenAI's Anthropic to search for specific information on a website. It takes a user-defined objective and website URL, then attempts to find relevant pages and extract information based on the objective.

### Requirements
1. **Firecrawl API key**: Obtain from your Firecrawl account.
2. **Anthropic API key**: Obtain from Anthropic if you're leveraging their models.
3. **AgentOps API key**: If using AgentOps, include its API key.

Set up your API keys as environment variables or directly in the notebook for ease of access.


In [1]:
%pip install -q firecrawl-py anthropic agentops

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/946.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m946.0/946.0 kB[0m [31m31.3 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/50.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.8/50.8 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/53.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m53.0/53.0 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/288.2 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m288.2/288.2 kB[0m [31m12.9 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [28]:
from getpass import getpass
from firecrawl import FirecrawlApp
import os, re, json, anthropic, agentops

FIRECRAWL_API_KEY··········
ANTHROPIC_API_KEY··········
AGENTOPS_API_KEY··········


In [None]:
# Initialize the FirecrawlApp, OpenAI client, and AgentOps
app = FirecrawlApp(api_key=getpass('FIRECRAWL_API_KEY'))
client = anthropic.Anthropic(api_key=getpass('ANTHROPIC_API_KEY'))
AGENTOPS_API_KEY = getpass('AGENTOPS_API_KEY')

### ANSI Color Codes
For adding colored output in the notebook, we define a class for color codes.


In [22]:
class Colors:
    CYAN = '\033[96m'
    YELLOW = '\033[93m'
    GREEN = '\033[92m'
    RED = '\033[91m'
    MAGENTA = '\033[95m'
    BLUE = '\033[94m'
    RESET = '\033[0m'

### Step 1: Finding the Relevant Page
The function `find_relevant_page_via_map` takes an objective and a website URL. It then uses the Anthropic client to generate search parameters for the Firecrawl API to map the website and identify relevant pages based on the objective.


In [23]:
def find_relevant_page_via_map(objective, url, app, client):
    try:
        print(f"{Colors.CYAN}Objective: {objective}{Colors.RESET}")
        print(f"{Colors.CYAN}Initiating search on the website: {url}{Colors.RESET}")

        map_prompt = f"""
        The map function generates a list of URLs from a website and accepts a search parameter.
        Based on the objective of: {objective}, suggest a 1-2 word search parameter.
        """

        completion = client.messages.create(
            model='claude-3-5-sonnet-20241022',
            max_tokens=1000,
            temperature=0,
            system="Expert web crawler",
            messages=[{'role': 'user', 'content': map_prompt}]
        )

        map_search_parameter = completion.content[0].text
        map_website = app.map_url(url, params={'search': map_search_parameter})

        print(f"{Colors.GREEN}Mapping completed. Links found: {len(map_website['links'])}{Colors.RESET}")
        return map_website['links']
    except Exception as e:
        print(f"{Colors.RED}Error: {str(e)}{Colors.RESET}")
        return None

### Step 2: Examining Top Pages using Firewcrawl's [Map](https://docs.firecrawl.dev/features/map)
The function `find_objective_in_top_pages` examines the top pages from the website map, attempting to fulfill the user's objective using scraped content. If the objective is met, it returns the relevant data in JSON format.

**Note:** Firecrawl's Map Response will be an ordered list from the most relevant to the least relevant. By selecting only the first three elements (`[:3]`), the function focuses on analyzing just the top three most relevant pages identified during the mapping stage..


In [24]:
def find_objective_in_top_pages(map_website, objective, app, client):
    try:
        # Get top 2 links from the map result
        top_links = map_website[:3]
        print(f"{Colors.CYAN}Analyzing the {len(top_links)} top links: {top_links}{Colors.RESET}")

        batch_scrape_result = app.batch_scrape_urls(top_links, {'formats': ['markdown']})
        print(f"{Colors.GREEN}Batch scraping completed.{Colors.RESET}")

        for scrape_result in batch_scrape_result['data']:
            check_prompt = f"""
            Given scraped content and objective, determine if the objective is met.
            Extract relevant information in simple JSON if met.
            Objective: {objective}
            Scraped content: {scrape_result['markdown']}
            """

            completion = client.messages.create(
                model='claude-3-5-sonnet-20241022',
                max_tokens=1000,
                temperature=0,
                system="Expert web crawler",
                messages=[{'role': 'user', 'content': check_prompt}]
            )

            result = completion.content[0].text
            if result and result != 'Objective not met':
                try:
                    return json.loads(result)
                except json.JSONDecodeError as e:
                    print(f"{Colors.RED}JSON parsing error: {e}. Raw result: {result}{Colors.RESET}")
                    continue  # Skip to the next result if parsing fails

        print(f"{Colors.RED}Objective not met in examined content.{Colors.RESET}")
        return None
    except Exception as e:
        print(f"{Colors.RED}Error during analysis: {str(e)}{Colors.RESET}")
        return None

### Step 3: Find and Extract Information

This function aims to find and extract information related to a given `objective` from the top-ranked pages of a website.

**Functionality:**

1. **Selects Top Links:** It selects the top two URLs from the `map_website` list, assuming they are the most relevant to the objective.
2. **Scrapes Content:** It uses the `app.batch_scrape_urls` function to scrape content from these selected URLs in Markdown format.
3. **Analyzes Content:**  For each scraped page, it constructs a prompt for the Anthropic Claude model. This prompt asks the model to determine if the scraped content fulfills the `objective`. If it does, the model is asked to extract the relevant information and format it as JSON.
4. **Extracts JSON:** The function uses a regular expression to identify JSON-like blocks within the Anth

In [25]:
def find_objective_in_top_pages(map_website, objective, app, client):
    try:
        top_links = map_website[:2]
        print(f"{Colors.CYAN}Analyzing top links: {top_links}{Colors.RESET}")

        batch_scrape_result = app.batch_scrape_urls(top_links, {'formats': ['markdown']})
        print(f"{Colors.GREEN}Batch scraping completed.{Colors.RESET}")

        # Regex pattern to match JSON-like blocks in the response
        json_pattern = r"\{(?:[^{}]|(?:\{[^{}]*\}))*\}"

        for scrape_result in batch_scrape_result['data']:
            check_prompt = f"""
            Given scraped content and objective, determine if the objective is met.
            Extract relevant information in simple JSON if met.
            Objective: {objective}
            Scraped content: {scrape_result['markdown']}
            """

            completion = client.messages.create(
                model='claude-3-5-sonnet-20241022',
                max_tokens=1000,
                temperature=0,
                system="Expert web crawler",
                messages=[{'role': 'user', 'content': check_prompt}]
            )

            result = completion.content[0].text
            # Search for JSON-like block in the result text
            json_match = re.search(json_pattern, result, re.DOTALL)
            if json_match:
                try:
                    return json.loads(json_match.group(0))
                except json.JSONDecodeError as e:
                    print(f"{Colors.RED}JSON parsing error: {e}. Raw result: {json_match.group(0)}{Colors.RESET}")
                    continue  # Skip to the next result if parsing fails
            else:
                print(f"{Colors.YELLOW}No JSON found in the response. Raw result: {result}{Colors.RESET}")

        print(f"{Colors.RED}Objective not met in examined content.{Colors.RESET}")
        return None
    except Exception as e:
        print(f"{Colors.RED}Error during analysis: {str(e)}{Colors.RESET}")
        return None

### Step 4: Executing the Main Function
The main function prompts for user input (website URL and objective), calls the `find_relevant_page_via_map` and `find_objective_in_top_pages` functions, and displays results accordingly.


In [26]:
def main():
    url = input(f"{Colors.BLUE}Enter website URL:{Colors.RESET}") or "https://www.firecrawl.dev/"
    objective = input(f"{Colors.BLUE}Enter objective:{Colors.RESET}") or "find pricing plans"

    map_website = find_relevant_page_via_map(objective, url, app, client)

    if map_website:
        result = find_objective_in_top_pages(map_website, objective, app, client)
        if result:
            print(f"{Colors.GREEN}Objective met. Extracted info:{Colors.RESET}")
            print(f"{Colors.MAGENTA}{json.dumps(result, indent=2)}{Colors.RESET}")
        else:
            print(f"{Colors.RED}Objective not fulfilled with available content.{Colors.RESET}")
    else:
        print(f"{Colors.RED}No relevant pages identified.{Colors.RESET}")

In [29]:
main()

[94mEnter website URL:[0mhttps://www.firecrawl.dev/
[94mEnter objective:[0myes or no: is firecrawl backed by y combinator?
[96mObjective: yes or no: is firecrawl backed by y combinator?[0m
[96mInitiating search on the website: https://www.firecrawl.dev/[0m
'\x1b[92mMapping completed. Links found: 42\x1b[0m'
[96mAnalyzing top links: ['https://www.firecrawl.dev/blog/your-ip-has-been-temporarily-blocked-or-banned', 'https://www.firecrawl.dev/blog/how-to-quickly-install-beautifulsoup-with-python'][0m
[92mBatch scraping completed.[0m
[92mObjective met. Extracted info:[0m
[95m{
  "can_determine": false,
  "reason": "No mention of Y Combinator backing in the provided content"
}[0m
