<img src="https://drive.google.com/uc?export=view&id=1wYSMgJtARFdvTt5g7E20mE4NmwUFUuog" width="200">

[![Build Fast with AI](https://img.shields.io/badge/BuildFastWithAI-GenAI%20Bootcamp-blue?style=for-the-badge&logo=artificial-intelligence)](https://www.buildfastwithai.com/genai-course)
[![EduChain GitHub](https://img.shields.io/github/stars/satvik314/educhain?style=for-the-badge&logo=github&color=gold)](https://github.com/satvik314/educhain)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1DMN2Z6YxB8Ku6yb84tM3MDRnSmIHqyEM#scrollTo=svSNXqChDApd)
## Master Generative AI in 6 Weeks
**What You'll Learn:**
- Build with Latest LLMs
- Create Custom AI Apps
- Learn from Industry Experts
- Join Innovation Community
Transform your AI ideas into reality through hands-on projects and expert mentorship.
[Start Your Journey](https://www.buildfastwithai.com/genai-course)
*Empowering the Next Generation of AI Innovators

##üî• FireCrawl: Advanced Web Scraping and Data Extraction for AI Applications
Empower your AI apps with clean data from any website. Featuring advanced scraping, crawling, and data extraction capabilities.

###**Setup and Installation**



In [None]:
pip install firecrawl-py

### **Setup the API Key**


In [None]:
from google.colab import userdata
import os

os.environ['FIRECRAWL_API_KEY']=userdata.get('FIRECRAWL_API_KEY')

firecrawl_api_key = os.getenv("FIRECRAWL_API_KEY")

### **Scrape a Website**


In [None]:
from firecrawl.firecrawl import FirecrawlApp

app = FirecrawlApp(api_key=firecrawl_api_key)


# Scrape a website:
scrape_status = app.scrape_url(
  'https://www.buildfastwithai.com/',
  params={'formats': ['markdown', 'html']}
)
print(scrape_status)

{'markdown': "![](/_next/image?url=%2F_next%2Fstatic%2Fmedia%2Fcursor_image.ed9ca23e.png&w=128&q=75)\n\nAsk to\n\n### BuildFast Bot\n\n![](/_next/image?url=%2F_next%2Fstatic%2Fmedia%2Fcursor_image.ed9ca23e.png&w=128&q=75)\n\nHey! Wanna know about Generative AI Crash Course?\n\nWhat will I learn?How can I join?What is the course duration?What is the course fee?What is the course schedule?What is the course syllabus?\n\nSend\n\n[GenAI 2025 Launch Pad](/genai-course)\n\n# Transform AI Ideas   into Reality\n\nJoin 15,000+ professionals mastering practical AI development at Build Fast with AI\n\nLearn. Build. Deploy.\n\n[Begin Your AI Journey](/genai-course) Join Waitlist\n\nWho We Are\n\n## Accelerate Your AI Journey with Us\n\nBuild Fast with AI helps professionals and businesses rapidly implement Gen AI through practical, hands-on education and consulting. Founded by IIT Delhi alumni, we've trained 15,000+ professionals from Google, Amazon, BCG, McKinsey and more.\n\n### Corporate Traini

### **Crawl a Website**


In [None]:
crawl_status = app.crawl_url(
  'https://www.buildfastwithai.com/',
  params={
    'limit': 100,
    'scrapeOptions': {'formats': ['markdown', 'html']}
  },
  poll_interval=30
)
print(crawl_status)



### **Scrape Hacker News Data**

In [None]:
import json
from firecrawl import FirecrawlApp
from dotenv import load_dotenv
from pydantic import BaseModel, Field
from typing import List
from datetime import datetime

load_dotenv()

BASE_URL = "https://news.ycombinator.com/"


class NewsItem(BaseModel):
    title: str = Field(description="The title of the news item")
    source_url: str = Field(description="The URL of the news item")
    author: str = Field(
        description="The URL of the post author's profile concatenated with the base URL."
    )
    rank: str = Field(description="The rank of the news item")
    upvotes: str = Field(description="The number of upvotes of the news item")
    date: str = Field(description="The date of the news item.")


class NewsData(BaseModel):
    news_items: List[NewsItem]


def get_firecrawl_news_data():
    app = FirecrawlApp()

    data = app.scrape_url(
        BASE_URL,
        params={
            "formats": ["extract"],
            "extract": {"schema": NewsData.model_json_schema()},
        },
    )

    return data


def save_firecrawl_news_data():
    """
    Save the scraped news data to a JSON file with the current date in the filename.
    """
    # Get the data
    data = get_firecrawl_news_data()
    # Format current date for filename
    date_str = datetime.now().strftime("%Y_%m_%d_%H_%M")
    filename = f"firecrawl_hacker_news_data_{date_str}.json"

    # Save the news items to JSON file
    with open(filename, "w") as f:
        json.dump(data["extract"]["news_items"], f, indent=4)

    print(f"{datetime.now()}: Successfully saved the news data.")


if __name__ == "__main__":
    save_firecrawl_news_data()

2024-12-24 08:53:10.174913: Successfully saved the news data.


### **Load and Display Saved News Data**

In [None]:
import json

with open("/content/firecrawl_hacker_news_data_2024_12_24_08_53.json", "r") as f:
    print(json.load(f))

[{'title': '38th Chaos Communication Congress', 'source_url': 'https://events.ccc.de/congress/2024/infos/index.html', 'author': 'https://news.ycombinator.com/user?id=joeig', 'rank': '1', 'upvotes': '19', 'date': '36 minutes ago'}, {'title': 'The number pi has an evil twin', 'source_url': 'https://mathstodon.xyz/@johncarlosbaez/113703444230936435', 'author': 'https://news.ycombinator.com/user?id=pkaeding', 'rank': '2', 'upvotes': '143', 'date': '5 hours ago'}, {'title': 'Making AMD GPUs competitive for LLM inference (2023)', 'source_url': 'https://blog.mlc.ai/2023/08/09/Making-AMD-GPUs-competitive-for-LLM-inference', 'author': 'https://news.ycombinator.com/user?id=plasticchris', 'rank': '3', 'upvotes': '166', 'date': '8 hours ago'}, {'title': "What happened to the world's largest tube TV? [video]", 'source_url': 'https://www.youtube.com/watch?v=JfZxOuc9Qwk', 'author': 'https://news.ycombinator.com/user?id=ecliptik', 'rank': '4', 'upvotes': '454', 'date': '13 hours ago'}, {'title': 'Buil

###**Map a website**

In [None]:
from firecrawl import FirecrawlApp

app = FirecrawlApp(api_key=firecrawl_api_key)


map_result = app.map_url('https://www.buildfastwithai.com/')
print(map_result)


{'success': True, 'links': ['https://buildfastwithai.com', 'https://www.buildfastwithai.com/consulting', 'https://www.buildfastwithai.com/events', 'https://www.buildfastwithai.com/apps', 'https://www.buildfastwithai.com/contact', 'https://www.buildfastwithai.com/genai-course', 'https://www.buildfastwithai.com/privacy-policy', 'https://www.buildfastwithai.com/refund-policy', 'https://www.buildfastwithai.com/terms-and-conditions', 'https://www.buildfastwithai.com/events/gpt-4o-deep-dive', 'https://www.buildfastwithai.com/events/gen-ai-for-excel', 'https://www.buildfastwithai.com/events/llama-3-deep-dive', 'https://www.buildfastwithai.com/events/automate-your-travel-planning', 'https://www.buildfastwithai.com/events/master-automation-with-ai', 'https://www.buildfastwithai.com/events/function-calling-with-llms', 'https://www.buildfastwithai.com/events/web-scraping-with-gen-ai', 'https://www.buildfastwithai.com/events/10x-developer-productivity-with-ai', 'https://www.buildfastwithai.com/eve

### **Extract Specific Data from a Website**

In [None]:
from firecrawl import FirecrawlApp
from pydantic import BaseModel, Field

# Initialize the FirecrawlApp with your API key
app = FirecrawlApp(api_key=firecrawl_api_key)

class ExtractSchema(BaseModel):
    company_mission: str
    supports_sso: bool
    is_open_source: bool
    is_in_yc: bool

data = app.scrape_url('https://www.buildfastwithai.com/', {
    'formats': ['extract'],
    'extract': {
        'schema': ExtractSchema.model_json_schema(),
    }
})
print(data["extract"])


{'company_mission': 'Build Fast with AI helps professionals and businesses rapidly implement Gen AI through practical, hands-on education and consulting.', 'supports_sso': False, 'is_open_source': False, 'is_in_yc': False}
