# Designing a Web Crawler

Let's design a Web Crawler that will browse and download the World Wide Web. 

## What's a Web Crawler?
It's a software program which browses the WWW in a methodical and automated manner, collecting documents by recursively fetching links from a set of starting pages.

Search engines use web crawling as a means to provide up-to-date data. Search engines download all pages and create an index on them to perform faster searches.

Other uses of web crawlers:
- Test web pages and links for valid syntax and structure.
- To search for copyright infringements.
- To maintain mirror sites for popular web sites.
- To monitor sites to see where their content changes.

## 1. Requirements and Goals of the System
**Scalability:** Our service needs to be scalable, since we'll be fetching hundreds of millions of web documents.

**Extensibility:** Our service should be designed in a way that allows newer functionality to be added to it. It should be able to allow for newer document formats that needs to be downloaded and processed in future.

## 2. Design Considerations
We should be asking a few questions here:

#### Is it a crawler for only HTML pages? Or should we fetch and store other media types like images, videos, etc.
It's important to clarify this because it will change the design. If we're writing a general-purpose crawler, we might want to break down the parsing module into different sets of modules: one for HTML, another for videos,..., so basically each module handling a given media type.

For this design, let's assume our web crawler will deal with HTML only.

#### What protocols apart from HTTP are we looking at? FTP?
Let's assume HTTP for now. Again, it should not be hard to extend it to other protocols.

#### What is the expected number of pages we will crawl? How big will be the URL Database?
Let's assume we need to crawl 1Billion websites. Since one site can contain many URLs, assume an upper bound of `15 billion web pages`.


#### Robots Exclusion Protocol?
Some web crawlers implement the Robots Exclusion Protocol, which allows Webmasters to declare parts of their sites off limits to crawlers. The Robots Exclusion Protocol requires a Web Crawler to fetch a document called `robot.txt` which contains these declarations for that site before downloading any real content from it.
