<a href="https://colab.research.google.com/github/towardsai/ai-tutor-rag-system/blob/notebook%2Faman/notebooks/FireCrawl_scraping.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Install the requirements

In [None]:
pip install -q llama-index==0.10.30 openai==1.12.0 tiktoken==0.6.0 llama-index-readers-web firecrawl-py==1.2.3

### SET THE ENVIRONMENT VARIABLES

In [None]:
import os
os.environ["OPENAI_API_KEY"] = "<OPENAI_API_KEY>"
FIRECRAWL_API_KEY = "<FIRECRAWL_API_KEY>"

# SCRAPE WITH FIRECRAWL

## IMPORT THE FIRECRAWL WEBREADER

Firecrawl allows you to turn entire websites into LLM-ready markdown

Get the API key here
https://www.firecrawl.dev/app/api-keys

In [None]:
from llama_index.readers.web import FireCrawlWebReader

[nltk_data] Downloading package punkt_tab to
[nltk_data]     /usr/local/lib/python3.10/dist-
[nltk_data]     packages/llama_index/core/_static/nltk_cache...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


In [None]:

# using firecrawl to crawl a website
firecrawl_reader = FireCrawlWebReader(
    api_key=FIRECRAWL_API_KEY,  # Replace with your actual API key from https://www.firecrawl.dev/
    mode="scrape",
)

# Load documents from a single page URL
documents = firecrawl_reader.load_data(url="https://towardsai.net/")

In [None]:
from llama_index.core import VectorStoreIndex

index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()

In [None]:
res = query_engine.query("What is towards AI aim?")

print(res.response)

print("-----------------")
# Show the retrieved nodes
for src in res.source_nodes:
  print("Node ID\t", src.node_id)
  print("Title\t", src.metadata['title'])
  print("URL\t", src.metadata['sourceURL'])
  print("Score\t", src.score)
  print("Description\t", src.metadata.get("description"))
  print("-_"*20)

Towards AI aims to make AI and machine learning accessible to all by providing courses, blogs, tutorials, books, newsletters, and a community platform.
-----------------
Node ID	 fd7ec7d6-aaf7-4350-b1fd-7bb9f256abf1
Title	 Towards AI
URL	 https://towardsai.net/
Score	 0.8927276434780216
Description	 Towards AI is an online publication, which focuses on sharing high-quality publications, news, articles, and stories on AI and technology related topics.
-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_
Node ID	 e8b70bef-9f08-45e9-bb0b-c6177f711740
Title	 Towards AI
URL	 https://towardsai.net/
Score	 0.8873490308374337
Description	 Towards AI is an online publication, which focuses on sharing high-quality publications, news, articles, and stories on AI and technology related topics.
-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_


# CRAWL A WEBSITE

## Load The CSV

CSV contains the list of tools and url of the page which we use to get information about the tool.

In [None]:
import requests
import csv

# Google Sheets file URL (CSV export link)
url = 'https://docs.google.com/spreadsheets/d/1gHB-aQJGt9Nl3cyOP2GorAkBI_Us2AqkYnfqrmejStc/export?format=csv'

# Send a GET request to fetch the CSV file
response = requests.get(url)

response_list = []
# Check if the request was successful
if response.status_code == 200:
    # Decode the content to a string
    content = response.content.decode('utf-8')

    # Use the csv.DictReader to read the content as a dictionary
    csv_reader = csv.DictReader(content.splitlines(), delimiter=',')
    response_list = [row for row in csv_reader]
else:
    print(f"Failed to retrieve the file: {response.status_code}")


In [None]:
import random

start_index = random.randint(0, len(response_list) - 3)
website_list = response_list[start_index:start_index+2] # crawling 2 website for demo

In [None]:
import pprint
print("CSV data")
pprint.pprint(website_list)

CSV data
[{'': '',
  'Category': 'Database',
  'Description': 'Persistent key-value store for fast storage environments',
  'Is a direct URL company /tool website?': 'Yes',
  'Name': 'RocksDB',
  'Tool Type': 'Database',
  'URL': 'https://rocksdb.org/'},
 {'': '',
  'Category': 'Database',
  'Description': 'Document-oriented NoSQL database',
  'Is a direct URL company /tool website?': 'Yes',
  'Name': 'MongoDB',
  'Tool Type': 'Database',
  'URL': 'https://www.mongodb.com/lp/cloud/atlas/try4?utm_source=google&utm_campaign=search_gs_pl_evergreen_atlas_core_prosp-brand_gic-null_apac-ph_ps-all_desktop_eng_lead&utm_term=mongodb&utm_medium=cpc_paid_search&utm_ad=e&utm_ad_campaign_id=12212624359&adgroup=115749710543&cq_cmp=12212624359&gad_source=1&gclid=CjwKCAjw5Ky1BhAgEiwA5jGujmI0-QgV5DXTwtMUH6mJur8nIVAxkMMSoNHvp_519fBdvutBriWLHxoCe8AQAvD_BwE'}]


## Initialize the Firecrawl

In [None]:
import os
from firecrawl import FirecrawlApp
app = FirecrawlApp(api_key=FIRECRAWL_API_KEY)

In [None]:
import time

# Crawl websites and handle responses
url_response = {}
crawl_per_min = 1  # Max crawl per minute

# Track crawls
crawled_websites = 0
scraped_pages = 0

for i, website_dict in enumerate(website_list):
    url = website_dict.get('URL')
    print(f"Crawling: {url}")

    try:
        response = app.crawl_url(
            url,
            params={
                'limit': 3,  # Limit pages to scrape per site.
                'scrapeOptions': {'formats': ['markdown', 'html']}
            }
        )
        crawled_websites += 1

    except Exception as exc:
        print(f"Failed to fetch {url} -> {exc}")
        continue

    # Store the scraped data and associated info in the response dict
    url_response[url] = {
        "scraped_data": response.get("data"),
        "csv_data": website_dict
    }

    # Pause to comply with crawl per minute limit for free version its 1 crawl per minute
    if i!=len(website_list) and (i + 1) % crawl_per_min == 0:
        print("Pausing for 1 minute to comply with crawl limit...")
        time.sleep(60)  # Pause for 1 minute after every crawl


Crawling: https://rocksdb.org/
Pausing for 1 minute to comply with crawl limit...
Crawling: https://www.mongodb.com/lp/cloud/atlas/try4?utm_source=google&utm_campaign=search_gs_pl_evergreen_atlas_core_prosp-brand_gic-null_apac-ph_ps-all_desktop_eng_lead&utm_term=mongodb&utm_medium=cpc_paid_search&utm_ad=e&utm_ad_campaign_id=12212624359&adgroup=115749710543&cq_cmp=12212624359&gad_source=1&gclid=CjwKCAjw5Ky1BhAgEiwA5jGujmI0-QgV5DXTwtMUH6mJur8nIVAxkMMSoNHvp_519fBdvutBriWLHxoCe8AQAvD_BwE
Pausing for 1 minute to comply with crawl limit...


## Create  llamaindex documents from the scraped content

In [None]:
from llama_index.core import Document
documents = []

for _, scraped_content in url_response.items():
    csv_data = scraped_content.get("csv_data")
    scraped_results = scraped_content.get("scraped_data")

    for scraped_site_dict in scraped_results:
        for result in scraped_results:
            markdown_content = result.get("markdown")
            title = result.get("metadata").get("title")
            url = result.get("metadata").get("sourceURL")
            documents.append(
                Document(
                    text=markdown_content,
                    metadata={
                        "title": title,
                        "url": url,
                        "description": csv_data.get("Description"),
                        "category": csv_data.get("Category")
                    }
                )
            )


# Create The RAG Pipeline.

In [None]:
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core.node_parser import SentenceSplitter

llm = OpenAI(model="gpt-4o-mini")
embed_model = OpenAIEmbedding(model="text-embedding-3-large")
text_splitter = SentenceSplitter(chunk_size=512, chunk_overlap=30)

In [None]:
from llama_index.core import Settings

Settings.llm = llm
Settings.embed_model = embed_model
Settings.text_splitter = text_splitter

In [None]:
from llama_index.core import VectorStoreIndex

index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()

In [None]:
from IPython.display import Markdown, display
def display_response(response):
    display(Markdown(f"<b>{response}</b>"))

In [None]:
query = "<your-query-here>" # Enter your query here, it should be relevant to the crawled websites
res = query_engine.query("I want to use key value store which is the best db?")
display_response(res)

print("-----------------")
# Show the retrieved nodes
for src in res.source_nodes:
  print("Node ID\t", src.node_id)
  print("Title\t", src.metadata['title'])
  print("URL\t", src.metadata['url'])
  print("Score\t", src.score)
  print("Description\t", src.metadata.get("description"))
  print("Category\t", src.metadata.get("category"))
  print("-_"*20)

<b>RocksDB is a high-performance, adaptable key-value store optimized for fast storage environments. It is designed for maximum performance using a log structured database engine written in C++. RocksDB can handle a variety of workloads, from database storage engines to application data caching, making it a versatile option for different data needs.</b>

-----------------
Node ID	 d8913762-56d9-46e7-be6a-1472e8af426d
Title	 RocksDB | A persistent key-value store | RocksDB
URL	 http://rocksdb.org/
Score	 0.49236056175089676
Description	 Persistent key-value store for fast storage environments
Category	 Database
-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_


<b>MongoDB Atlas and RocksDB are both popular NoSQL databases, each with its own strengths. MongoDB Atlas is a cloud-based document-oriented database that offers a fully managed service with features like secure default settings, multi-cloud support, and a document model that aligns well with application code. On the other hand, RocksDB is a persistent key-value store known for its high performance in fast storage environments, particularly excelling in terms of low write amplification and write performance. Both databases cater to different use cases and have unique advantages based on the specific requirements of the application or workload.</b>

-----------------
Node ID	 0e8a82be-ca4d-4526-b45b-7b78d2323d42
Title	 MongoDB Atlas: Cloud Document Database | MongoDB
URL	 https://www.mongodb.com/lp/cloud/atlas/try4?utm_source=google&utm_campaign=search_gs_pl_evergreen_atlas_core_prosp-brand_gic-null_apac-ph_ps-all_desktop_eng_lead&utm_term=mongodb&utm_medium=cpc_paid_search&utm_ad=e&utm_ad_campaign_id=12212624359&adgroup=115749710543&cq_cmp=12212624359&gad_source=1&gclid=CjwKCAjw5Ky1BhAgEiwA5jGujmI0-QgV5DXTwtMUH6mJur8nIVAxkMMSoNHvp_519fBdvutBriWLHxoCe8AQAvD_BwE
Score	 0.46048151159026907
Description	 Document-oriented NoSQL database
Category	 Database
-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_
Node ID	 4f6405f8-eec6-482b-b4d8-d9cc76e436a3
Title	 Blog | RocksDB
URL	 https://rocksdb.org/blog/
Score	 0.39345216447105164
Description	 Persistent key-value store for fast storage environments
Category	 Database
-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_
