<a href="https://colab.research.google.com/github/alexfazio/firecrawl-quickstart/blob/main/web_crawler_grok_firecrawl.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Building a Web Crawler with Grok-2 and Firecrawl

By Alex Fazio (https://twitter.com/alxfazio)

Github repo: https://github.com/alexfazio/firecrawl-cookbook

This Jupyter notebook demonstrates how to combine Grok-2's language model capabilities with Firecrawl's web scraping features to build an intelligent web crawler that can extract structured data from websites.

By the end of this notebook, you'll be able to:

1. Set up the Grok-2 and Firecrawl environment
2. Build a targeted web crawler that understands content
3. Extract and process structured data from websites
4. Export the processed content in JSON format

This cookbook is designed for developers and data scientists who want to build advanced web crawlers with AI-powered content understanding.

## Setup

First, let's install the required packages:

In [None]:
%pip install firecrawl-py requests python-dotenv --quiet

In [None]:
import os
import json
import requests
from dotenv import load_dotenv
from firecrawl import FirecrawlApp

## Initialize Environment

Enter your API keys securely:

In [None]:
from getpass import getpass

# Securely get API keys
grok_api_key = getpass("Enter your Grok-2 API key: ")
firecrawl_api_key = getpass("Enter your Firecrawl API key: ")

# Initialize FirecrawlApp
app = FirecrawlApp(api_key=firecrawl_api_key)

## Define Grok-2 API Interaction

Let's create a function to handle interactions with the Grok-2 API, including comprehensive error handling and debugging information:

In [None]:
def grok_completion(prompt):
    url = "https://api.x.ai/v1/chat/completions"
    headers = {
        "Content-Type": "application/json",
        "Authorization": f"Bearer {grok_api_key}"
    }
    data = {
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": prompt}
        ],
        "model": "grok-beta",
        "stream": False,
        "temperature": 0
    }

    try:
        response = requests.post(url, headers=headers, json=data)
        print(f"\nAPI Response Status Code: {response.status_code}")

        if response.status_code != 200:
            print(f"Error Response: {response.text}")
            return None

        response_data = response.json()
        print("\nFull API Response:")
        print(json.dumps(response_data, indent=2))

        if 'choices' not in response_data:
            print("\nWarning: 'choices' key not found in response")
            print("Available keys:", list(response_data.keys()))
            return None

        if not response_data['choices']:
            print("\nWarning: 'choices' array is empty")
            return None

        choice = response_data['choices'][0]
        if 'message' not in choice:
            print("\nWarning: 'message' key not found in first choice")
            print("Available keys in choice:", list(choice.keys()))
            return None

        if 'content' not in choice['message']:
            print("\nWarning: 'content' key not found in message")
            print("Available keys in message:", list(choice['message'].keys()))
            return None

        return choice['message']['content']

    except requests.exceptions.RequestException as e:
        print(f"\nRequest Error: {str(e)}")
        return None
    except json.JSONDecodeError as e:
        print(f"\nJSON Decode Error: {str(e)}")
        print("Raw Response:", response.text)
        return None
    except Exception as e:
        print(f"\nUnexpected Error: {str(e)}")
        return None

## Website Crawling Functions

This function combines Grok-2's understanding with Firecrawl's search capabilities to find relevant pages. It:

1. Uses Grok-2 to distill the user's objective into a focused search term
2. Enforces strict formatting rules for consistent search terms
3. Cleans and normalizes the search output
4. Uses Firecrawl's map endpoint to discover relevant pages

The function takes a broad objective (e.g., "Find articles about startup investments") and converts it into an optimized search term (e.g., "startup funding") to ensure targeted results.

Note: The function limits search terms to 2 words maximum for optimal performance with Firecrawl's search algorithm.

In [None]:
def find_relevant_pages(objective, url):
    prompt = f"""Based on the objective '{objective}', provide ONLY a 1-2 word search term to locate relevant information on the website.

Rules:
- Return ONLY the search term, nothing else
- Maximum 2 words
- No punctuation or formatting
- No explanatory text"""

    search_term = grok_completion(prompt)

    if search_term is None:
        print("Failed to get search term from Grok-2 API")
        return []

    # Clean up the search term
    search_term = search_term.strip().replace('"', '').replace('*', '')
    words = search_term.split()
    if len(words) > 2:
        search_term = " ".join(words[:2])

    print(f"Using search term: '{search_term}'")

    try:
        map_result = app.map_url(url, params={"search": search_term})
        return map_result.get("links", [])
    except Exception as e:
        print(f"Error mapping URL: {str(e)}")
        return []

## Content Extraction and Processing

This function handles the extraction and intelligent processing of content from each webpage. It:

1. Scrapes content from each relevant page
2. Uses Grok-2 to analyze the content against our objective
3. Extracts structured data in JSON format
4. Handles various edge cases and errors

The function processes up to 3 pages and returns the first successful match, using Grok-2 to determine relevance and extract specific data points.

In [None]:
def extract_data_from_pages(links, objective):
    for link in links[:3]:
        try:
            print(f"\nProcessing link: {link}")
            scrape_result = app.scrape_url(link, params={'formats': ['markdown']})
            content = scrape_result.get('markdown', '')

            if not content:
                print("No content extracted from page")
                continue

            prompt = f"""Given the following content, extract the information related to the objective '{objective}' in JSON format. If not found, reply 'Objective not met'.

Content: {content}

Remember:
- Only return JSON if the objective is met.
- Do not include any extra text or markdown formatting.
- Do not wrap the JSON in code blocks.
"""
            result = grok_completion(prompt)

            if result is None:
                print("Failed to get response from Grok-2 API")
                continue

            result = result.strip()

            # Handle case where response is wrapped in code blocks
            if result.startswith("```") and result.endswith("```"):
                # Remove the code block markers and any language identifier
                result = result.split("\n", 1)[1].rsplit("\n", 1)[0]

            if result != "Objective not met":
                try:
                    data = json.loads(result)
                    return data
                except json.JSONDecodeError as e:
                    print(f"Error parsing JSON response: {str(e)}")
                    print("Raw response:", result)
                    continue
            else:
                print("Objective not met for this page")

        except Exception as e:
            print(f"Error processing page: {str(e)}")
            continue

    return None

## Main Execution

Let's create and run the main function that ties everything together:

In [None]:
import pprint

def main():
    url = input("Enter the website URL to crawl: ")
    objective = input("Enter your data extraction objective: ")

    print("\nFinding relevant pages...")
    links = find_relevant_pages(objective, url)

    if not links:
        print("No relevant pages found.")
        return

    print(f"\nFound {len(links)} relevant pages:")
    for i, link in enumerate(links[:3], 1):
        pprint.pprint(f"{i}. {link}")

    print("\nExtracting data from pages...")
    data = extract_data_from_pages(links, objective)

    if data:
        print("\nData extracted successfully:")
        pprint.pprint(json.dumps(data, indent=2))
    else:
        print("Could not find data matching the objective.")

In [None]:
# Run the crawler
main()

## What's Next?

Now that you have a working web crawler, consider these enhancements:

1. Add error handling and retries
2. Implement concurrent processing
3. Add content filtering and validation
4. Create custom extraction rules

The combination of Grok-2 and Firecrawl offers powerful possibilities for intelligent web scraping and content analysis.

## Additional Resources

- [x.ai Grok-2 API Documentation](https://api.x.ai/docs)
- [Firecrawl Python Library Documentation](https://docs.firecrawl.dev)
- [Example Code Repository](https://github.com/example/web-crawler)