<a href="https://colab.research.google.com/github/alexfazio/firecrawl-quickstart/blob/main/llm_extract_tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Firecrawl LLM Extract Tutorial

By Alex Fazio (https://twitter.com/alxfazio)

Github repo: https://github.com/alexfazio/firecrawl-cookbook

This Jupyter notebook demonstrates how to use Firecrawl's LLM Extract feature to extract structured data from web pages. By the end of this tutorial, you'll be able to:

1. Set up the Firecrawl environment
2. Extract data using a schema
3. Extract data using prompts without a schema

This cookbook is designed for developers who want to efficiently extract structured data from web pages using LLMs.

## Requirements

Before proceeding, ensure you have:

- **Firecrawl API key**: Required for accessing the Firecrawl service
- Python environment with required packages

We'll be using the following packages:
- `firecrawl`: For interacting with the Firecrawl API
- `pydantic`: For schema definition

## Setup

First, let's install the required packages:

In [1]:
%pip install firecrawl-py pydantic --quiet

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/164.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m164.1/164.1 kB[0m [31m6.1 MB/s[0m eta [36m0:00:00[0m
[?25h

Next, let's set up our Firecrawl API key:

In [2]:
from getpass import getpass
api_key = getpass("Enter your Firecrawl API key: ")

Enter your Firecrawl API key: ··········


## Extracting Data with Schema

Let's start by importing the required libraries and defining our schema for extraction:

In [3]:
from firecrawl import FirecrawlApp
from pydantic import BaseModel, Field

# Initialize the FirecrawlApp with your API key
app = FirecrawlApp(api_key=api_key)

class ExtractSchema(BaseModel):
    company_mission: str
    supports_sso: bool
    is_open_source: bool
    is_in_yc: bool

data = app.scrape_url('https://docs.firecrawl.dev/', {
    'formats': ['extract'],
    'extract': {
        'schema': ExtractSchema.model_json_schema(),
    }
})

print(data['extract'])

{'company_mission': "Train a secure AI on your technical resources that answers customer and employee questions so your team doesn't have to", 'supports_sso': True, 'is_open_source': False, 'is_in_yc': True}


## Extracting Data without Schema

Firecrawl also supports extraction using just a prompt, allowing the LLM to determine the structure:

In [7]:
# Method 1: Using curl with a properly formatted command string
curl_command = f'''
curl -X POST https://api.firecrawl.dev/v1/scrape \
  -H 'Content-Type: application/json' \
  -H 'Authorization: Bearer {api_key}' \
  -d '{{
    "url": "https://docs.firecrawl.dev/",
    "formats": ["extract"],
    "extract": {{
      "prompt": "Extract the company mission from the page."
    }}
  }}'
'''

!{curl_command}

{"success":true,"data":{"extract":{"company_mission":"Train a secure AI on your technical resources that answers customer and employee questions so your team doesn't have to"},"metadata":{"title":"Quickstart | Firecrawl","description":"Firecrawl allows you to turn entire websites into LLM-ready markdown","language":"en","ogLocaleAlternate":[],"viewport":"width=device-width","msapplication-config":"https://mintlify.s3-us-west-1.amazonaws.com/firecrawl/_generated/favicon/browserconfig.xml?v=3","apple-mobile-web-app-title":"Firecrawl Docs","application-name":"Firecrawl Docs","msapplication-TileColor":"#000","theme-color":"#ffffff","charset":"utf-8","og:type":"website","og:site_name":"Firecrawl Docs","twitter:card":"summary_large_image","og:title":"Quickstart | Firecrawl","twitter:title":"Firecrawl Docs","og:image":"/images/og.png","twitter:image":"/images/og.png","og:description":"Firecrawl allows you to turn entire websites into LLM-ready markdown","og:url":"https://docs.firecrawl.dev/in

In [8]:
# Method 2: Alternative approach using requests library
import requests
import json

url = "https://api.firecrawl.dev/v1/scrape"
headers = {
    "Content-Type": "application/json",
    "Authorization": f"Bearer {api_key}"
}
payload = {
    "url": "https://docs.firecrawl.dev/",
    "formats": ["extract"],
    "extract": {
        "prompt": "Extract the company mission from the page."
    }
}

response = requests.post(url, headers=headers, json=payload)
print(json.dumps(response.json(), indent=2))

{
  "success": true,
  "data": {
    "extract": {
      "company_mission": "Train a secure AI on your technical resources that answers customer and employee questions so your team doesn't have to"
    },
    "metadata": {
      "title": "Quickstart | Firecrawl",
      "description": "Firecrawl allows you to turn entire websites into LLM-ready markdown",
      "language": "en",
      "ogLocaleAlternate": [],
      "viewport": "width=device-width",
      "msapplication-config": "https://mintlify.s3-us-west-1.amazonaws.com/firecrawl/_generated/favicon/browserconfig.xml?v=3",
      "apple-mobile-web-app-title": "Firecrawl Docs",
      "application-name": "Firecrawl Docs",
      "msapplication-TileColor": "#000",
      "theme-color": "#ffffff",
      "charset": "utf-8",
      "og:type": "website",
      "og:site_name": "Firecrawl Docs",
      "twitter:card": "summary_large_image",
      "og:title": "Quickstart | Firecrawl",
      "twitter:title": "Firecrawl Docs",
      "og:image": "/images

## Next Steps

You've now learned how to:
1. Set up Firecrawl for data extraction
2. Extract data using a defined schema
3. Extract data using prompts without a schema

For more information about the extract format and additional features, visit the [Firecrawl documentation](https://docs.firecrawl.dev/features/extract).