
### Crawl a website and store HTML to a Google Cloud Bucket

This colab gives you the ability to crawl your website, and put the HTML files into a Google Cloud Bucket you control.

**What is "Google Cloud Bucket"?**

A Google Cloud Bucket is a container for storing any type of data in Google [Cloud Storage](https://storage.cloud.google.com/).

**What is "Scrapy"?**

[Scrapy](http://scrapy.org) is a free and open-source web scraping framework that can be used to crawl websites and extract data. It can be customized in many ways, both in what URLs you allow to be crawled and also in what outputs you want to save.

**Why?**

Perhaps you need to consume unstructured HTML data from a bucket, like for [Vertex AI Search and Conversation](https://cloud.google.com/vertex-ai-search-and-conversation).  _NOTE: The Vertex AI Search and Conversation Website Crawler is significantly better, this is a DIY example._

<small><em>blame: alanblount@google.com</em></small>



In [29]:
#!pip install twisted
from twisted.internet import asyncioreactor
asyncioreactor.install()

### Alternatives

Here are some other very easy to use apps which can crawl a website and save the HTML files to a Google Cloud Bucket:

**1. Screaming Frog**

[Screaming Frog](https://www.screamingfrog.co.uk/) is a desktop app that is available for both Windows and Mac. It is a popular SEO tool that can be used to crawl websites and identify technical issues, such as broken links, duplicate content, and missing meta descriptions. Screaming Frog can also be used to export the HTML files for a website to a Google Cloud Bucket.

To export the HTML files for a website to a Google Cloud Bucket using Screaming Frog, follow these steps:

1. Open Screaming Frog and enter the URL of the website that you want to crawl.
2. Click the "Spider" button to start the crawl.
3. Once the crawl is complete, click the "Export" button and select "Google Cloud Bucket" from the list of export options.
4. Enter the name of your Google Cloud Bucket and the path to the directory where you want to save the HTML files.
5. Click the "Export" button to start the export process.

**2. HTTrack Website Copier**

HTTrack Website Copier is a free and open-source cross-platform software application that recursively downloads World Wide Web sites for offline browsing. It can also be used to export the HTML files for a website to a Google Cloud Bucket.

To export the HTML files for a website to a Google Cloud Bucket using HTTrack Website Copier, follow these steps:

1. Open HTTrack Website Copier and enter the URL of the website that you want to crawl.
2. Click the "Next" button to continue.
3. On the "New Project" screen, enter a name for your project and select the "Download website" option.
4. Click the "Next" button to continue.
5. On the "Download options" screen, select the "Download HTML files only" option.
6. Click the "Next" button to continue.
7. On the "Where to download" screen, select the "Download to a Google Cloud Bucket" option.
8. Click the "Finish" button to start the crawl and export process.

**3. Scriptable Browser**

Do you need a full browser, for functionality which isn't currently supported in any of the above crawlers?

Take a look at this list https://github.com/dhamaniasad/HeadlessBrowsers

You'll have to extract `a href={url}` from each page and create a queue to crawl your own content, but you will be able to do so on any site you can access via a browser.

For easiest onboarding, check out https://www.cypress.io/


### Prerequesites

* A Google Cloud Account
* A Google Cloud Project
* A Google Cloud Bucket
* A publicly available website without dynamically loading content
  * if you something more sophisticated than `scrapy`, see the `alternatives` section

### Let's do this

**Process Summary**

1. Authenticate to Google Cloud (must already have an account)
1. Install Dependancies
  * [Scrapy](http://scrapy.org) to crawl websites
  * [Crochet](https://pypi.org/project/crochet/) to execute the crawler in a non-blocking thread
1. Configure Scrapy Crawler (enter your URL & Bucket)
1. Execute the Crawler

In [30]:
# @title Authenticate to Google Cloud (must already have an account)
from google.colab import auth as google_auth
google_auth.authenticate_user()

In [31]:
# @title Install Dependancies
!pip install scrapy bs4 google.cloud crochet uuid python-slugify -q

CRITICAL:twisted:Unhandled error in Deferred:
CRITICAL:twisted:
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/twisted/internet/defer.py", line 1092, in _runCallbacks
    current.result = callback(  # type: ignore[misc]
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/crochet/_eventloop.py", line 121, in put
    err(result, "Unhandled error in EventualResult")
  File "/usr/local/lib/python3.12/dist-packages/twisted/python/log.py", line 124, in err
    _stuff = failure.Failure()
             ^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/twisted/python/failure.py", line 288, in __init__
    raise NoCurrentExceptionError()
twisted.python.failure.NoCurrentExceptionError


In [37]:
# @title Configure Scrapy Crawler (enter your URL & Bucket)

# Use the params on the right for convenience.
website_url = 'https://help.kryptomate.com/en//index.html' # @param {type:"string"}
storage_bucket = 'chatbot-bucket-km9272025' # @param {type:"string"}
metadata_filename = 'kryptomate_metadata.jsonl' # @param {type:"string"}
remove_url_fragment_after_hash = True # @param {type:"boolean"}
remove_url_fragment_after_question = True # @param {type:"boolean"}
wait_for_seconds = 6000 # @param {type:"number"}

# What file extensions should the crawler ignore?
disallowed_extensions = [".pdf", ".css", ".txt", ".png", ".jpg", ".jpeg",
                         ".webp", ".gif", ".svg", ".ico", ".woff", ".woff2",
                         ".ttf", ".otf", ".eot", ".mp3", ".mp4", ".m4a",
                         ".m4v", ".mov", ".webm", ".mkv", ".avi", ".bit",
                         ".zip", ".tar", ".gz", ".7z", ".tor", ".rar",
                         ".js",".json", ".jsonl", ".torrent",
                         ".xml", ".xsl", ".rss", ".atom"]

# Import some basic, common modules.
from collections import deque
from google.cloud import storage
from slugify import slugify
import re
import random
import json
import uuid
from urllib.parse import urljoin

# Ensures we don't crawl the same URL twice.
completed_urls = []
# Acts as a buffer, containing lines for the metadata file.
metadata_buffer = deque()




In [38]:
# @title Helper functions to manage URLs.
def is_disallowed_extension(url):
    """Disallows URLs which end in any of the extensions in the `disallowed_extensions` list.

    Args:
        url (str): The URL to check.

    Returns:
        bool: True if the URL should be disallowed, False otherwise.
    """

    # Check if the URL ends in any of the extensions in the `disallowed_extensions` list
    for extension in disallowed_extensions:
        if url.endswith(extension):
            return True

    # If the URL does not end in any of the extensions in the `disallowed_extensions` list, then it is allowed
    return False


def is_allowed_url(url):
    """Allows only full URLs, not a disallowed file extension, and not already crawled.

    Args:
        url (str): The URL to check.

    Returns:
        bool: True if the URL should be allowed, False otherwise.
    """
    # require URL pattern
    url_pattern = "^https?:\\/\\/(?:www\\.)?[-a-zA-Z0-9@:%._\\+~#=]{1,256}\\.[a-zA-Z0-9()]{1,6}\\b(?:[-a-zA-Z0-9()@:%_\\+.~#?&\\/=]*)$"
    if not re.match(url_pattern, url):
      return False
    # disallow login & account pages
    url_pattern = ".*(login|logout|signin|signup|signout|register|account).*"
    if re.match(url_pattern, url.lower()):
      return False
    # disallow file extensions
    if is_disallowed_extension(url):
      return False
    # disallow repeat URLs
    if url in completed_urls:
      return False
    return True

def simplify_url(full_url):
    """Trim any trailing `?` and whitespace from the full URL.

    Args:
        full_url (str): The full URL.

    Returns:
        str: The full URL simplified for filename.
    """
    if remove_url_fragment_after_hash:
      pattern = r"#.*$"
      full_url = re.sub(pattern, "", full_url)

    if remove_url_fragment_after_question:
      pattern = r"\?.*$"
      full_url = re.sub(pattern, "", full_url)

    # remove empty trailing space and ? and #
    full_url = full_url.strip()
    full_url = full_url.strip('?#')
    full_url = full_url.strip()

    return full_url

In [39]:
# @title Helper functions to manage log the metadata for each file to a .jsonl file.

def add_metadata(new_metadata_dict):
    """Adds a single line of metadata to a buffer and may flush that buffer.

    Args:
        new_metadata_dict (dict): The dict of data which will be added.

    Returns:
        bool: None
    """
    metadata_buffer.append(json.dumps(new_metadata_dict))
    if len(metadata_buffer) >= 30:
        flush_metadata_buffer()

def flush_metadata_buffer():
    """Flushes the metadata buffer to the metadata file in Cloud Storage.

    Returns:
        bool: None
    """
    bucket = storage.Client().get_bucket(storage_bucket)
    # Read the existing JSON file into a string.
    metadata = ''
    if bucket.blob(metadata_filename).exists():
      blob = bucket.blob(metadata_filename)
      metadata = blob.download_as_string()
      metadata = metadata.decode('utf-8')

    while len(metadata_buffer) > 0:
      metadata += metadata_buffer.popleft() + '\n'

    # Save file
    blob = bucket.blob(metadata_filename)
    blob.upload_from_string(metadata, content_type='text/html')
    print(f'👍 metadata saved to {metadata_filename}')

In [40]:
# @title Create the MySpider(scrapy.Spider) class

# Website crawler package
import scrapy
from scrapy.crawler import CrawlerRunner
from bs4 import BeautifulSoup

# Setup multi-threaded support via Twisted
from crochet import setup, wait_for, TimeoutError
setup()

class MySpider(scrapy.Spider):
    """Setup a new spider with params from colab notebook.

    See more options for the scapy package here:
    https://docs.scrapy.org/en/latest/index.html

    Args:
        scrapy ([type]): [description]
    """
    name = "my_spider"
    start_urls = [website_url]
    allowed_domains = [url.split('/')[2] for url in start_urls]

    custom_settings = {
        "REQUEST_FINGERPRINTER_IMPLEMENTATION": "2.7",
        "DOWNLOAD_TIMEOUT": 60,
    }

    def get_tag_text(self, soup, tag):
        """Returns the text from the given BeautifulSoup object."""
        element = soup.find(tag)
        return element.text if element else None

    def get_meta_tag_with_name(self, soup, name):
        """Returns the text from the given BeautifulSoup object."""
        meta_tag = soup.find('meta', {'name': name})
        return meta_tag['content'] if meta_tag else None

    def parse(self, response):
        # Extract the full URL of the webpage
        full_url = response.url

        # Create a filename we will save the HTML to
        filename = re.sub(r'https?[/:]+', '-', full_url)
        filename = f'{slugify(filename)}.html'
        html_string = response.body

        # Store file contents.
        bucket = storage.Client().get_bucket(storage_bucket)
        blob = bucket.blob(filename)
        blob.upload_from_string(html_string, content_type='text/html')
        print(filename)

        # Store metadata about the file.
        # https://cloud.google.com/generative-ai-app-builder/docs/prepare-data
        # https://cloud.google.com/generative-ai-app-builder/docs/provide-schema
        # Here, we will extract some data from the HTML and randomize more.
        soup = BeautifulSoup(html_string, 'html.parser')
        image_uri = image_name = None
        element = soup.find('img')
        if element and element['src']:
          image_uri = urljoin(full_url, element['src'])
          image_name = urljoin(full_url, element['alt'])
        # Some very common metadata schemes are baked in and automatic.
        tags = ["mock", "unknown", "missing", "todo"]
        categories = ["API Reference", "Blog", "Documentation"]
        # You can add your own tags as well.
        novel_doc_status = ["New", "Deprecated", "Stable"]
        add_metadata({
            "id": str(uuid.uuid4()),
            "structData": {
                "url": full_url,
                "title": self.get_tag_text(soup, 'title') or self.get_meta_tag_with_name(soup, 'title') or self.get_tag_text(soup, 'h1'),
                "keywords": self.get_meta_tag_with_name(soup, 'keywords'),
                "description": self.get_meta_tag_with_name(soup, 'description') or self.get_meta_tag_with_name(soup, 'desc'),
                "image": {
                    "image_uri": image_uri,
                    "image_name": image_name,
                },
                "tags": random.choice(tags),
                "category": random.choice(categories),
                "novel_doc_status": random.choice(novel_doc_status),
                # image, image_name, language_code, geolocation, question, answer, embedding_vector
            },
            "content": {
                "mimeType": "text/html",
                "uri": f"gs://{storage_bucket}/{filename}"
        }})

        # Recursivly follow hrefs on the page
        for href in response.css("a::attr(href)").getall():
            next_url = urljoin(full_url, simplify_url(href))
            if is_allowed_url(next_url):
                completed_urls.append(next_url)
                yield response.follow(next_url, callback=self.parse)


class CrawlerManager:
    @wait_for(float(wait_for_seconds))
    # @wait_for(float(3))
    def start_crawler(self):
        """Run spider with MySpider"""
        self.crawler = CrawlerRunner()
        self.d = self.crawler.crawl(MySpider)
        return self.d

    def stop_crawler(self):
        """Stop the crawler."""
        if self.crawler:
            self.crawler.stop()
            self.crawler = None
        flush_metadata_buffer()
        print("Spider Closed, cleanup complete.")
        return None



In [41]:
# @title Execute the Crawler
import traceback
def execute_the_crawler():
    crawler_manager = CrawlerManager()
    try:
        print('Starting...')
        crawler_manager.start_crawler()
        flush_metadata_buffer()
        return "Done!"
    except KeyboardInterrupt as e:
        print('KeyboardInterrupt')
        flush_metadata_buffer()
        return "Stopped the crawler due to KeyboardInterrupt (you clicked stop)."
    except TimeoutError as e:
        print('TimeoutError')
        flush_metadata_buffer()
        return "Stopped the crawler due to timeout."
    except Exception as e:
        print(f"Stopped the crawler due to an unexpected error: {type(e).__name__}")
        print(f"Error details: {e}")
        print("Full traceback:")
        traceback.print_exc()
        return "Stopped the crawler due to some other error."


execute_the_crawler()


Starting...
KeyboardInterrupt
👍 metadata saved to kryptomate_metadata.jsonl


'Stopped the crawler due to KeyboardInterrupt (you clicked stop).'