<a href="https://colab.research.google.com/github/alexfazio/devdocs-to-llm/blob/main/devdocs_to_llm_firecrawl.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# DevDocs to LLM: turn any documentation into a GPT

By Alex Fazio (https://twitter.com/alxfazio)

Github repo: https://github.com/alexfazio/devdocs-to-llm

This Jupyter notebook demonstrates how to use the Firecrawl API to crawl developer documentation, extract content, and process the information to be ready to be used in an assistant like ChatGPT.

By the end of this notebook, you'll be able to:

1. Set up the Firecrawl environment
Crawl a website and generate a sitemap
2. Extract content from crawled pages in Markdown
3. Export the processed content to various platforms and platforms, including Rentry.co and Google Docs!

This cookbook is designed for developers and data scientists who want to efficiently gather and analyze developer documentation at scale.

## Requirements
Before proceeding, ensure you have the following:

- **Firecrawl API key**: Essential for accessing the Firecrawl service
- Google Docs API credentials: (Optional) A JSON file named `client_secret_<...>.apps.googleusercontent.com.json` for Google Docs integration.

Note: The Google Docs API credential is only required if you plan to export content to Google Docs. All other functionalities can be used without this optional component.

## Tested Documentation Sources

| Status | Documentation Source | URL |
|--------|----------------------|-----|
| ✅ | CrewAI | https://docs.crewai.com/ |
| ✅ | Brave Search API | https://api.search.brave.com/app/documentation/ |
| ✅ | OpenAI | https://platform.openai.com/docs |
| ✅ | FireCrawl | https://docs.firecrawl.dev/ |
| ✅ | Anthropic | https://docs.anthropic.com/en/docs/ |
| ✅ | LangChain | https://python.langchain.com/v0.2/docs |

Note: A checkmark (✅) indicates successful testing with the corresponding documentation source.

## Setup

First, let's set up our environment with the necessary imports and initializations:

This notebook requires the following libraries:

- `firecrawl-py`: For web crawling and content extraction
- `requests`: For making HTTP requests
- `beautifulsoup4`: For parsing HTML content

In [None]:
%pip install firecrawl-py requests beautifulsoup4 --quiet
print("---")
%pip show requests beautifulsoup4

In [None]:
from firecrawl import FirecrawlApp

Next, insert you Firecrawl API key `fc-...`

In [None]:
from getpass import getpass
fc_api_key = getpass("Enter your Firecrawl API key: ")
assert fc_api_key != "", "Error: fc_api_key should not be an empty string"

## Crawling

Now let's crawl some DevDocs pages to use in our examples.

Enter the documentation URL in your browser's address bar to access the main documentation overview page, **rather than a specific section or page within the documentation**.

This allows you to start the crawl of the main documentation page and navigate to specific topics as needed.

In [None]:
sub_url = "https://docs.cursor.com/" # @param {type:"string"}
assert sub_url != "", "Error: sub_url should not be an empty string"

### Attempt Site Map Display and Page Count Preview (Optional)

The following cell will display the site map and preview the page count, providing only a general idea of the structure and number of pages. This process **will not** use the Firecrawl API or consume any usage tokens. Please be aware that this preview is not as accurate as crawling the entire website with Firecrawl.

In [None]:
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin

def get_site_map(url, base_url=None, depth=0, max_depth=3, visited=None, sitemap=None):
    if visited is None:
        visited = set()
    if sitemap is None:
        sitemap = []
    if base_url is None:
        base_url = url
    if depth > max_depth or url in visited:
        return

    visited.add(url)
    sitemap.append(url)

    try:
        response = requests.get(url)
        soup = BeautifulSoup(response.text, 'html.parser')
        print('| ' * depth + '+-- ' + url)

        for link in soup.find_all('a', href=True):
            href = link['href']
            full_url = urljoin(url, href)
            # Check if the full_url starts with the base_url
            if full_url.startswith(base_url):
                get_site_map(full_url, base_url, depth + 1, max_depth, visited, sitemap)
    except Exception as e:
        print(f"Error processing {url}: {str(e)}")

    return sitemap

def crawl_sub_url(sub_url, max_depth=3):
    # Ensure the sub_url ends with a '/'
    if not sub_url.endswith('/'):
        sub_url += '/'

    base_url = sub_url
    print(f"[REQUESTS CRAWL] Sitemap for {base_url}:")
    sitemap = get_site_map(base_url, base_url=base_url, max_depth=max_depth)
    print(f"\n[REQUESTS CRAWL] Total pages crawled: {len(sitemap)}")
    return sitemap

# Example usage
preview_sitemap = crawl_sub_url(sub_url)
print("\nSite map:")
print(preview_sitemap)

# Store the preview_sitemap in a global variable
sitemap = preview_sitemap

Next, let's initialize `FirecrawlApp()` a Python object that allows you to interact with the Firecrawl API. It essentially sets up a connection so you can use its methods (like crawl_url or scrape_url) to send requests to Firecrawl and get website data back.

In [None]:
  app = FirecrawlApp(api_key=fc_api_key)

To manage costs and control crawl scope, specify a maximum number of pages to crawl using the limit parameter below.

In [None]:
#@markdown Limit the crawl to a maximum of `limit` pages
limit = 55 #@param {type:"number"}

In [None]:
# @title Crawl Launch

import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin, urlparse
import random
from firecrawl import FirecrawlApp
import json

def merged_crawl(start_url, limit, fc_api_key=fc_api_key):
    def standard_crawl(start_url):
        visited = set()
        collected_urls = []

        def normalize_url(url):
            parsed_url = urlparse(url)
            normalized_url = parsed_url._replace(fragment="", query="").geturl()
            return normalized_url

        def is_valid_subpage(url):
            return url.startswith(start_url) and not url.startswith(start_url + '#')

        def crawl(url):
            if len(collected_urls) >= limit:
                return

            normalized_url = normalize_url(url)
            if (normalized_url in visited) or (not is_valid_subpage(normalized_url)):
                return
            visited.add(normalized_url)
            try:
                response = requests.get(normalized_url)
                if response.status_code != 200:
                    return
            except requests.exceptions.RequestException:
                return

            collected_urls.append(normalized_url)
            print(f"[REQUESTS CRAWLER] {len(collected_urls)}/{limit}: {normalized_url}")

            if len(collected_urls) >= limit:
                return

            soup = BeautifulSoup(response.content, "html.parser")
            for link in soup.find_all("a", href=True):
                if len(collected_urls) >= limit:
                    return
                href = link['href']
                full_url = urljoin(url, href)
                normalized_full_url = normalize_url(full_url)
                if is_valid_subpage(normalized_full_url):
                    crawl(normalized_full_url)

        crawl(start_url)
        return collected_urls

    def firecrawl_method(start_url, limit):
        app = FirecrawlApp(api_key=fc_api_key)
        crawl_result = app.crawl_url(
            start_url,
            {
                'crawlerOptions': {
                    'includePaths': ['/docs/', '/documentation/'],
                    'limit': limit,
                }
            }
        )

        urls_to_scrape = []
        for page_data in crawl_result:
            metadata = page_data.get('metadata', {})
            source_url = metadata.get('sourceURL')
            if source_url:
                urls_to_scrape.append(source_url)
            links_on_page = metadata.get('linksOnPage')
            if links_on_page:
                urls_to_scrape.extend(links_on_page)

        return urls_to_scrape

    try:
        print("[CRAWLER] Attempting standard crawl...")
        result = standard_crawl(start_url)
        if not result:
            raise Exception("[CRAWLER] Standard crawl returned no results")
        return result
    except Exception as e:
        print(f"[CRAWLER] Standard crawl failed: {str(e)}")
        print("[CRAWLER] Falling back to firecrawl method...")
        return firecrawl_method(start_url, limit)

# Usage
sitemap = merged_crawl(sub_url, limit)
print("---")
print(f"[CRAWLER] Crawled URLs (sitemap): {sitemap}")
print(f"[CRAWLER] Number of pages crawled: {len(sitemap)}")

## Scraping

With our sitemap in hand, we can now proceed to extract content from each page. Firecrawl's content extraction capabilities allow us to efficiently parse web pages and retrieve the main content as markdown, filtering out navigation menus, advertisements, and other non-essential elements.

Before we begin the extraction process, let's set some parameters:

- `scrape_option`: Choose whether to scrape all pages or a specific number of pages.
- `num_pages`: If scraping a specific number of pages, set the desired number here.

Please set these parameters in the cell below.

In [None]:
# @title Scraping Options

# Create a dropdown for scrape options
scrape_option = "Specific number of pages"  # @param ["All pages", "Specific number of pages"]

# Create a numerical input for the specific number of pages
num_pages = 55  # @param {type:"number"}

# Initialize the num_pages variable depending on the scrape_option
if scrape_option == "Specific number of pages":
    # Check if the number of pages exceeds the length of the sitemap
    num_pages = min(num_pages, len(sitemap))  # Set num_pages to the smaller of the two values
else:
    # If "All pages" is selected, set num_pages to the total length of the sitemap
    num_pages = len(sitemap)

# Now you can proceed with your scraping logic using num_pages
print(f"[SCRAPER] Number of pages to scrape: {num_pages}")

Now, let's execute the content extraction process. Our script will:

1. Iterate through the URLs in our sitemap
2. Use Firecrawl's API to extract the main content from each page
3. Store the extracted content in both XML and Markdown formats

XML helps in structuring large documents before feeding them to an LLM for RAG or direct query.

Here is the edited table with only the scraping feature:

| **Plan**     | **Max Pages Scraped/Minute** |
|--------------|------------------------------|
| Free         | 5                             |
| Hobby        | 10                            |
| Standard     | 50                            |
| Growth       | 500                           |

Set the times below, ensuring that you stay within the appropriate rate limits for your FireCrawl usage tier and avoid triggering any scraping restrictions.

Check [FireCrawl docs](https://docs.firecrawl.dev/rate-limits) for further guidance on rate limits.

In [None]:
# @title Rate Limiting and Retry Parameters
# @markdown Set the rate limiting and retry parameters for the web scraper:

pages_per_minute = 9 # @param {type:"integer"}
# @markdown Number of pages that can be scraped per minute

wait_time_between_chunks = 33 # @param {type:"integer"}
# @markdown Waiting time (in seconds) between max chunks

retry_attempts = 3 # @param {type:"integer"}
# @markdown Number of times to retry failed scrapes

print(f"Pages per minute: {pages_per_minute}")
print(f"Wait time between chunks: {wait_time_between_chunks} seconds")
print(f"Number of retry attempts: {retry_attempts}")

# You can use these variables in your main scraping code

## Check the status of the `sitemap` variable for debugging



In [None]:
print(sitemap)

In [None]:
import time
import math
import requests

# @title Scrape Launch

# Initialize a file to store the XML content
output_file = 'scraped_content.xml'

# Initialize strings to store all the XML and markdown content
all_xml = "<document>\n"
all_markdown = ""

# Initialize a list to store failed scrape URLs
failed_scrapes = []

# Determine the number of pages to scrape
pages_to_scrape = len(sitemap) if scrape_option == "All pages" else min(num_pages, len(sitemap))

# Calculate the chunk size and total number of chunks
chunk_size = pages_per_minute
total_chunks = math.ceil(pages_to_scrape / chunk_size)

def scrape_url(url, attempt=1):
    print(f"[FIRECRAWL SCRAPER] Attempting to scrape URL: {url} (Attempt {attempt})")
    start_time = time.time()
    try:
        response = app.scrape_url(url=url, params={
            'onlyMainContent': True,
            'waitFor': 5000,
        })
        markdown_content = response.get('markdown', '')
        end_time = time.time()
        scrape_time = end_time - start_time
        print(f"[FIRECRAWL SCRAPER] Successfully scraped URL: {url}")
        print(f"[FIRECRAWL SCRAPER] Scrape time: {scrape_time:.2f} seconds")
        print(f"[FIRECRAWL SCRAPER] Content length: {len(markdown_content)} characters")
        return markdown_content
    except requests.exceptions.HTTPError as e:
        end_time = time.time()
        scrape_time = end_time - start_time
        print(f"[FIRECRAWL SCRAPER] Error scraping {url}: {str(e)}")
        print(f"[FIRECRAWL SCRAPER] Scrape time (failed): {scrape_time:.2f} seconds")
        return None

def process_scraped_content(url, markdown_content):
    global all_xml, all_markdown
    # Create XML structure for this page without indentation
    page_xml = f"<page>\n<content>\n{markdown_content}\n</content>\n</page>\n"

    # Add the markdown content with a proper separator
    if all_markdown:  # If it's not the first entry, add a separator
        all_markdown += "\n\n***\n\n"
    all_markdown += f"# {url}\n\n{markdown_content}"

    # Append the page XML to the XML string
    all_xml += page_xml
    print(f"[FIRECRAWL SCRAPER] Processed content for URL: {url}")
    print(f"[FIRECRAWL SCRAPER] XML content length: {len(page_xml)} characters")
    return page_xml

# Open the file in write mode
with open(output_file, 'w') as file:
    # Write the opening XML tag
    file.write("<document>\n")

    for chunk in range(total_chunks):
        chunk_start = chunk * chunk_size
        chunk_end = min((chunk + 1) * chunk_size, pages_to_scrape)
        print(f"[FIRECRAWL SCRAPER] Processing chunk {chunk+1}/{total_chunks} (URLs {chunk_start+1}-{chunk_end})")

        for i in range(chunk_start, chunk_end):
            url = sitemap[i]
            print(f"[FIRECRAWL SCRAPER] Processing URL {i+1}/{pages_to_scrape}: {url}")
            markdown_content = scrape_url(url)

            if markdown_content is not None:
                page_xml = process_scraped_content(url, markdown_content)
                # Write the page XML to the file
                file.write(page_xml)
                print(f"[FIRECRAWL SCRAPER] Successfully wrote content for URL: {url}")
            else:
                failed_scrapes.append(url)
                print(f"[FIRECRAWL SCRAPER] Failed to scrape URL: {url}")

        # Wait after each chunk, except for the last one
        if chunk < total_chunks - 1:
            print(f"[FIRECRAWL SCRAPER] Chunk {chunk+1} completed. Waiting for {wait_time_between_chunks} seconds before the next chunk...")
            time.sleep(wait_time_between_chunks)

    # Retry failed scrapes
    for attempt in range(retry_attempts):
        if not failed_scrapes:
            break
        print(f"[FIRECRAWL SCRAPER] Retry attempt {attempt + 1} of {retry_attempts} for {len(failed_scrapes)} failed scrapes...")
        retry_urls = failed_scrapes.copy()
        failed_scrapes = []
        for url in retry_urls:
            print(f"[FIRECRAWL SCRAPER] Retrying URL: {url}")
            markdown_content = scrape_url(url, attempt=attempt+2)
            if markdown_content is not None:
                page_xml = process_scraped_content(url, markdown_content)
                # Write the page XML to the file
                file.write(page_xml)
                print(f"[FIRECRAWL SCRAPER] Successfully scraped and wrote content for retried URL: {url}")
            else:
                failed_scrapes.append(url)
                print(f"[FIRECRAWL SCRAPER] Failed to scrape URL on retry: {url}")

        if failed_scrapes:
            print(f"[FIRECRAWL SCRAPER] Retry attempt {attempt + 1} completed. Waiting for {wait_time_between_chunks} seconds before the next retry attempt...")
            time.sleep(wait_time_between_chunks)

    # Write the closing XML tag
    file.write("</document>")
    print("[FIRECRAWL SCRAPER] Finished writing to XML file")

# Add the closing tag to the XML string variable
all_xml += "</document>"

# Now you can use the 'all_xml' and 'all_markdown' string variables as needed
print(f"[FIRECRAWL SCRAPER] Total characters in all_xml: {len(all_xml)}")
print(f"[FIRECRAWL SCRAPER] Total characters in all_markdown: {len(all_markdown)}")
print(f"[FIRECRAWL SCRAPER] Number of pages scraped: {pages_to_scrape}")
print(f"[FIRECRAWL SCRAPER] Number of pages that failed to scrape after all retries: {len(failed_scrapes)}")
if failed_scrapes:
    print("[FIRECRAWL SCRAPER] Failed URLs:")
    for url in failed_scrapes:
        print(url)

## Exporting Extracted Content

After extracting the content, we have several options for exporting and storing the data. In this notebook, we'll demonstrate two export methods:

1. Exporting to Rentry.co, a simple pastebin-like service
2. Exporting to Google Docs

In [None]:
# @title Export to Rentry.com

# Import necessary libraries
import os
import requests
import re

# Function to strip HTML tags
def strip_html_tags(text):
    return re.sub('<[^<]+?>', '', text)

# Function to create a new Rentry post
def new_rentry(url, edit_code, text):
    base_url = os.getenv('BASE_URL', 'https://rentry.co')
    api_url = f"{base_url}/api/new"

    # Get CSRF token
    session = requests.Session()
    response = session.get(base_url)
    csrf_token = session.cookies.get('csrftoken')

    # Prepare payload
    payload = {
        'csrfmiddlewaretoken': csrf_token,
        'url': url,
        'edit_code': edit_code,
        'text': text
    }

    headers = {
        "Referer": base_url,
        "X-CSRFToken": csrf_token
    }

    # Make POST request
    response = session.post(api_url, data=payload, headers=headers)
    return response.json()

# Function to export content to Rentry
def export_to_rentry(content):
    cleaned_content = strip_html_tags(content)

    # Check if the content exceeds 200,000 characters
    if len(cleaned_content) > 200000:
        print("The content exceeds 200,000 characters. Please try using Google Docs instead due to the character limit.")
        return None, None

    url = ''  # Leave empty for random URL
    edit_code = ''  # Leave empty for random edit code
    response = new_rentry(url, edit_code, cleaned_content)
    if response['status'] == '200':
        return response['url'], response['edit_code']
    else:
        return None, None

# Main execution
rentry_url, rentry_edit_code = export_to_rentry(all_xml)

if rentry_url and rentry_edit_code:
    print(f"Rentry document created successfully!")
    print(f"URL: {rentry_url}")
    print(f"Edit code: {rentry_edit_code}")
elif len(strip_html_tags(all_xml)) > 200000:
    # This message is already printed in the export_to_rentry function, but we'll keep it here for clarity
    print("The content exceeds 200,000 characters. Please try using Google Docs instead due to the character limit.")
else:
    print("Failed to create Rentry document.")

In [None]:
# @title Export to Google Docs

from google.colab import files
from google.oauth2.credentials import Credentials
from google_auth_oauthlib.flow import Flow
from googleapiclient.discovery import build
import json
import io
import getpass

# Function to securely get input
def secure_input(prompt):
    return getpass.getpass(prompt)

# Upload button for JSON credentials file
print("Please upload your client secret JSON file.")
uploaded = files.upload()

# Get the filename of the uploaded file
filename = list(uploaded.keys())[0]

# Read the contents of the uploaded file
client_secret_json = io.StringIO(uploaded[filename].decode('utf-8')).read()

# Parse the JSON content
client_secret_data = json.loads(client_secret_json)

# Create a Flow instance
flow = Flow.from_client_config(
    client_secret_data,
    scopes=['https://www.googleapis.com/auth/documents'],
    redirect_uri='urn:ietf:wg:oauth:2.0:oob')

# Tell the user to go to the authorization URL.
auth_url, _ = flow.authorization_url(prompt='consent')

print("Please go to this URL to authorize the application:")
print(auth_url)

# The user will get an authorization code. This line will wait for the user to input it securely.
code = secure_input("Enter the authorization code: ")

# Exchange the authorization code for credentials.
flow.fetch_token(code=code)

# Get the credentials
creds = flow.credentials

# Create a Docs API service object
service = build('docs', 'v1', credentials=creds)

# Create a new document
document = service.documents().create(body={'title': 'My New Document'}).execute()
print(f"Created document with title: {document.get('title')}")

# Get the document ID
document_id = document.get('documentId')

# Prepare the content to be inserted
requests = [
    {
        'insertText': {
            'location': {
                'index': 1,
            },
            'text': all_markdown
        }
    }
]

# Execute the request to insert the content
result = service.documents().batchUpdate(documentId=document_id, body={'requests': requests}).execute()

print(f"Document content updated. You can find it at: https://docs.google.com/document/d/{document_id}/")

# Clear sensitive variables
del client_secret_json, client_secret_data, code, creds