 Install Firecrawl Library
This cell installs the firecrawl library using pip. The -U flag ensures that the library is upgraded to the latest version if it's already installed. Firecrawl is a web scraping and data extraction tool.

In [1]:
!pip install -U firecrawl


Collecting firecrawl
  Downloading firecrawl-2.16.5-py3-none-any.whl.metadata (7.2 kB)
Collecting python-dotenv (from firecrawl)
  Downloading python_dotenv-1.1.1-py3-none-any.whl.metadata (24 kB)
Downloading firecrawl-2.16.5-py3-none-any.whl (40 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.1/40.1 kB[0m [31m528.8 kB/s[0m eta [36m0:00:00[0m
[?25hDownloading python_dotenv-1.1.1-py3-none-any.whl (20 kB)
Installing collected packages: python-dotenv, firecrawl
Successfully installed firecrawl-2.16.5 python-dotenv-1.1.1


 Import Necessary Libraries
This cell imports the Python libraries required for the project.

os: Provides a way to interact with the operating system, used here to access environment variables.
firecrawl: The Firecrawl library for web scraping and data extraction.
dotenv: Used to load environment variables from a .env file (though not explicitly used in the provided code, it's imported, suggesting its potential use for API keys).
pandas: A data manipulation and analysis library, used here to work with data in DataFrame format.
typing: Provides support for type hints, used for better code readability and maintainability.
pydantic: A data validation library, used to define data models and generate JSON schemas for structured data extraction.
time: Provides time-related functions, although not used in the current code, it's imported.

Define WebsiteScraper Class
This cell defines a Python class WebsiteScraper to encapsulate the web scraping logic using the Firecrawl library.

The __init__ method initializes the FirecrawlApp with an API key obtained from the environment variables. It also initializes a default schema fields list.
The create_dynamic_model method dynamically creates a Pydantic BaseModel class based on a list of schema fields provided. This allows for flexible data extraction based on user-defined fields.
The create_schema_from_fields method takes the schema fields and generates a JSON schema using the dynamically created Pydantic model. This schema is used by Firecrawl for structured data extraction.
The convert_to_table method (although not used in the subsequent cells) is intended to convert extracted data into a pandas DataFrame and return it as a string.
The scrape_website method is the main function for scraping. It takes a website URL, a prompt for the extraction, and optional schema fields. It calls the Firecrawl extract method with the provided parameters and returns the extracted data.

In [13]:
import os
from firecrawl import FirecrawlApp
from dotenv import load_dotenv
import pandas as pd
from typing import Dict, Any
from pydantic import BaseModel
import time

class WebsiteScraper:
    def __init__(self):
        load_dotenv()
        self.firecrawl_api_key = os.getenv("FIRECRAWL_API_KEY")
        self.app = FirecrawlApp(api_key=self.firecrawl_api_key)
        self.schema_fields = [{"name": "", "type": "str"}]

    def create_dynamic_model(self, fields):
        """Create a dynamic Pydantic model from schema fields."""
        field_annotations = {}
        for field in fields:
            if field["name"]:
                type_mapping = {
                    "str": str,
                    "bool": bool,
                    "int": int,
                    "float": float
                }
                field_annotations[field["name"]] = type_mapping[field["type"]]

        return type(
            "ExtractSchema",
            (BaseModel,),
            {
                "_annotations_": field_annotations
            }
        )

    def create_schema_from_fields(self, fields):
        """Create schema using Pydantic model."""
        if not any(field["name"] for field in fields):
            return None

        model_class = self.create_dynamic_model(fields)
        return model_class.model_json_schema()

    def convert_to_table(self, data: Dict[str, Any]) -> str:
        """Convert data to a pandas DataFrame and return as string."""
        if not data or 'data' not in data:
            return ""

        df = pd.DataFrame([data['data']])
        return df.to_string(index=False)

    def scrape_website(self, website_url: str, prompt: str, schema_fields=None):
        """Main function to scrape website data."""
        if not website_url:
            raise ValueError("Please provide a website URL")

        try:
            schema = self.create_schema_from_fields(schema_fields) if schema_fields else None

            extract_params = {'prompt': prompt}
            if schema:
                extract_params['schema'] = schema

            data = self.app.extract([website_url,],
                                    **extract_params # Pass extract_params as keyword arguments
                                    )

            return data

        except Exception as e:
            raise Exception(f"An error occurred: {str(e)}")

Instantiate WebsiteScraper and Set Parameters
This cell creates an instance of the WebsiteScraper class and sets the parameters for the web scraping task.

scraper = WebsiteScraper(): Creates an object of the WebsiteScraper class.
website_url = "https:: Defines the URL pattern to scrape. The * indicates that Firecrawl should scrape all pages under this path.

prompt = "extract publish date, title and link of all articles related to LLMs": Defines the prompt for the Firecrawl extraction, specifying what information to extract.

schema_fields = [...]: Defines an optional list of schema fields to guide the extraction. This is commented out in the current code, meaning structured extraction with a predefined schema is not being used in this specific execution.

result = scraper.scrape_website(website_url, prompt, []): Calls the scrape_website method to perform the extraction. Note that an empty list [] is passed for schema_fields, so no schema is used.

print("Results:\n"): Prints a heading before displaying the results.
print(result): Prints the raw output from the Firecrawl extraction.

In [14]:
scraper = WebsiteScraper()

# Get user input
website_url = "https://blog.dailydoseofds.com/*"
prompt = "extract publish date, title and link of all articles related to LLMs"

# Optional: Add schema fields
schema_fields = [
    {"name": "Article_title", "type": "str"},
    {"name": "Publish_date", "type": "str"},
    {"name": "Article_link", "type": "str"}
]

# Get results
result = scraper.scrape_website(website_url, prompt, [])
print("Results:\n")
print(result)

Results:



Define Pydantic Model for Schema
This cell defines a Pydantic BaseModel called ExtractSchema. This model represents the structure of the data that is expected to be extracted from the website when using a schema.

mission: str: Defines a field named mission with a string type.
supports_sso: bool: Defines a field named supports_sso with a boolean type.
is_open_source: bool: Defines a field named is_open_source with a boolean type.
is_in_yc: bool: Defines a field named is_in_yc with a boolean type.

In [22]:
class ExtractSchema(BaseModel):
    mission: str
    supports_sso: bool
    is_open_source: bool
    is_in_yc: bool

Generate JSON Schema from Pydantic Model
This cell uses the Pydantic BaseModel defined in the previous cell (ExtractSchema) to generate its corresponding JSON schema. The model_json_schema() method is called on the ExtractSchema class to produce the schema, which is then printed to the output.

This JSON schema can be used by Firecrawl to ensure that the extracted data conforms to a specific structure.

In [23]:
ExtractSchema.model_json_schema()

{'properties': {'mission': {'title': 'Mission', 'type': 'string'},
  'supports_sso': {'title': 'Supports Sso', 'type': 'boolean'},
  'is_open_source': {'title': 'Is Open Source', 'type': 'boolean'},
  'is_in_yc': {'title': 'Is In Yc', 'type': 'boolean'}},
 'required': ['mission', 'supports_sso', 'is_open_source', 'is_in_yc'],
 'title': 'ExtractSchema',
 'type': 'object'}

Attempt to Create Schema from Empty Fields.

This cell attempts to use the create_schema_from_fields method of the WebsiteScraper class with an empty list schema_fields. This is likely a test to see how the method handles the case where no schema fields are provided. The output shows that an empty schema with only properties, title, and type is generated, which is the expected behavior when no specific fields are defined.



In [24]:
scraper.create_schema_from_fields(schema_fields)

{'properties': {}, 'title': 'ExtractSchema', 'type': 'object'}

Perform Extraction with Schema (Corrected).

This cell demonstrates how to perform a web extraction using the FirecrawlApp directly, incorporating a Pydantic schema for structured output.

app = FirecrawlApp(api_key=os.getenv("FIRECRAWL_API_KEY")): Initializes the FirecrawlApp.

class ExtractSchema(BaseModel): ...: Redefines the ExtractSchema Pydantic model with fields relevant to the desired extraction (article title, publish date, link).

data = app.extract(...): Calls the app.extract method.
urls=["https:]: Provides the URL pattern in a list as the first argument.
prompt='...': Provides the extraction prompt as a keyword argument.

schema=ExtractSchema.model_json_schema(): Provides the JSON schema generated from the Pydantic model as a keyword argument.
This cell shows the corrected way to pass parameters to the app.extract method for structured extraction.
print(data): Prints the extracted data, which is now structured according to the defined schema.

In [26]:
from firecrawl import FirecrawlApp
from pydantic import BaseModel, Field
import os

# Initialize the FirecrawlApp with your API key
app = FirecrawlApp(api_key=os.getenv("FIRECRAWL_API_KEY"))

class ExtractSchema(BaseModel):
    article_title: str
    publish_date: str
    article_link: str

data = app.extract(
    urls=["https://blog.dailydoseofds.com/*"], # Pass URL in a list as the first argument
    prompt='Extract the article title, publish date, and article link of all articles related to LLMs.',
    schema=ExtractSchema.model_json_schema(), # Pass schema as a keyword argument
)
print(data)



Display Results in Markdown Table.
This cell takes the result object obtained from the web scraping (specifically from Cell 4) and displays the extracted data in a markdown table format using pandas.

It checks if result and result.data exist and if result.data contains an 'articles' key (which is the case when scraping multiple articles). If so, it creates a DataFrame from the list of articles.

It also includes a condition to handle cases where the result.data is a single dictionary (e.g., from scraping a single URL), creating a DataFrame with a single row in that case.

df.to_markdown(index=False): Converts the pandas DataFrame to a markdown formatted string. index=False prevents the DataFrame index from being included in the output.

print(...): Prints the markdown table to the console.

In [27]:
if result and result.data and 'articles' in result.data:
    df = pd.DataFrame(result.data['articles'])
    print(df.to_markdown(index=False))
elif result and result.data:
    # Handle the case where the data is a single dictionary (from a single URL extraction)
    df = pd.DataFrame([result.data])
    print(df.to_markdown(index=False))
else:
    print("No data to display in markdown.")

| link                                                                                                | title                                                                           | publish_date   |
|:----------------------------------------------------------------------------------------------------|:--------------------------------------------------------------------------------|:---------------|
| https://blog.dailydoseofds.com/p/5-llm-fine-tuning-techniques-explained                             | 5 LLM Fine-tuning Techniques Explained Visually                                 | 2024-05-30     |
| https://www.dailydoseofds.com/understanding-lora-derived-techniques-for-optimal-llm-fine-tuning/    | Understanding LoRA-derived Techniques for Optimal LLM Fine-tuning               |                |
| https://www.dailydoseofds.com/implementing-lora-from-scratch-for-fine-tuning-llms/                  | Implementing LoRA From Scratch for Fine-tuning LLMs                             |   