# Designing a Web Crawler

Let's design a Web Crawler that will browse and download the World Wide Web. 

## What's a Web Crawler?
It's a software program which browses the WWW in a methodical and automated manner, collecting documents by recursively fetching links from a set of starting pages.

Search engines use web crawling as a means to provide up-to-date data. Search engines download all pages and create an index on them to perform faster searches.

Other uses of web crawlers:
- Test web pages and links for valid syntax and structure.
- To search for copyright infringements.
- To maintain mirror sites for popular web sites.
- To monitor sites to see where their content changes.

## 1. Requirements and Goals of the System
**Scalability:** Our service needs to be scalable, since we'll be fetching hundreds of millions of web documents.

**Extensibility:** Our service should be designed in a way that allows newer functionality to be added to it. It should be able to allow for newer document formats that needs to be downloaded and processed in future.

## 2. Design Considerations
We should be asking a few questions here:

#### Is it a crawler for only HTML pages? Or should we fetch and store other media types like images, videos, etc.
It's important to clarify this because it will change the design. If we're writing a general-purpose crawler, we might want to break down the parsing module into different sets of modules: one for HTML, another for videos,..., so basically each module handling a given media type.

For this design, let's assume our web crawler will deal with HTML only.

#### What protocols apart from HTTP are we looking at? FTP?
Let's assume HTTP for now. Again, it should not be hard to extend it to other protocols.

#### What is the expected number of pages we will crawl? How big will be the URL Database?
Let's assume we need to crawl 1Billion websites. Since one site can contain many URLs, assume an upper bound of `15 billion web pages`.


#### Robots Exclusion Protocol?
Some web crawlers implement the Robots Exclusion Protocol, which allows Webmasters to declare parts of their sites off limits to crawlers. The Robots Exclusion Protocol requires a Web Crawler to fetch a document called `robot.txt` which contains these declarations for that site before downloading any real content from it.


## 3. Capacity Estimation and Constraints
If we crawl 15B pages in 4 weeks, how many pages will we need to fetch per second?

```text
        15B / (4 weeks * 7 days * 86400 sec) ~= 6200 web pages/sec
```

**What about storage?** Pages sizes vary. But since we are dealing with HTML only, let's assume an average page size is 100KB. With each page, if we're storing 500 bytes of metadata, total storage we would need is:

```text
        15B * (100KB + 500 bytes)
        15 B * 100.5 KB ~= 1.5 Petabytes
```

We don't want to go beyond 70% capacity of our storage system, so the total storage we will need is:

```text
        1.5 petabytes / 0.7 ==> 2.14 Petabytes     
```

##  4. High Level Design
The basic algorithm of a web crawler is this:

1. Taking in a list of seed URLs as input, pick a URL from the unvisited URL list.
2. Find the URL host-name's IP address.
3. Establish a connection to the host to download its corresponding documents.
4. Parse the documents contents to look for new URLs.
5. Add the new URLs to the list of unvisited URLs.
6. Process the downloaded document, e.g, store it, or index the contents
7. Go back to step 1.

### How to Crawl

Breath first or depth first?
Breadth-first search (BFS) is usually used. We can also use Depth-first search especially when the crawler has already established a connection with a website. In this situation, the crawler will just DFS all the URLs within the website to save some handshaking overhead.

**Path-ascending crawling:** Path-ascending crawling helps discover a hidden or isolated resources. In this scheme, a crawler would ascend to every path in each URL like so:
```text
    given a seed URL of http://xyz.com/a/b/one.html

    it will attempt to crawl /a/b/, /a/ and /
```

### Difficulties implementing an efficient web crawler.
#### 1. Large volume of web pages
A large volume implies that the web crawler can only dowload a fraction of the web pages, so it's critical that the web crawler should be intelligent enough to prioritize download.

#### 2. Rate of change on web pages
Web pages change frequenty. By the time the crawler is downloading the last page from the site, the page may change dynamically, or a new page may be added.

**Components of a bare minimum crawler:**
1. **URL frontier:** stores a list of URLs to download and prioritize which URLs should be crawled first.
2. **HTTP Fetcher:** to retrieve a web page from the hosts server.
3. **Extractor:** to extract links from HTML documents.
4. **Duplicate Remover:** to make sure same content is not extracted twice.
5. **Datastore:** to store retrieved pages, URLs and other metadata.

![](images/designing_webcrawler_high_level.png)

## 5. Detailed Component Design
Assume the crawler is running on a single server, where multiple working threads are performing all the steps needed to download and process a document in a loop.

**Step 1:** remove an absolute URL from the shared URL frontier for downloading. the URL begins with a scheme (e.g HTTP) which identifies the network protocol that should be used to download it.
We can implement these protocols in a modular way for extensibility, so that later if our crawler needs to support more protocols, it can easily be done.

**Step 2:** Based on the URL's scheme, the worker calls the appropriate protocol module to download the document.

**Step 3:** After downloading, the document is written into a Document Input Stream (DIS). This will enable other modules to re-read the document multiple times.

**Step 4:** The worker invokes the dedupe test to see whether this document (associated with a different URL) has already been seen before. If so, the document is not processed any further and the worker thread removes the next URL from the frontier.

**Step 5:** Process the downloaded document. Each doc has a different MIME type like HTML page, Image, Video etc. We can implement these MIME schemes in a modular way, to allow for extensibility when our crawler need to support more types. The worker invokes the process method of each processing module with that MIME type.

**Step 6:** The HTML processing module will extract links from the page. Each link is converted into an absolute URL and testsed against a user-supplied filter to determine if it should be downloaded. If the URL passes the filter, the worker performs the URL-dedupe test, which checks if the URL has been downloaded before. If it's new, it is added into the URL frontier.