## Overview
A web crawler begins by collecting a set of webpages and then following links on those pages to collect new pages. It is used for the following purpose:
- indexing search engines as in the case of Google, Bing, etc.
- archiving webpages like in the case of Archive.org
- web mining to collect useful information from web

## Calculations
Lets say the crawler has to collect 1 billion pages a month. This means that it has to collect $\frac{1,000,000,000}{30 \times 24 \times 3600} = 400$ pages a second. If the peak QPS is double than average, then it would be $800$ pages a second.

Assuming that we need to save webpages, and average size of a webpage being $500 KB$ in size, this means we need $\frac{1,000,000,000 \times 500}{1024 \times 1024 \times 1024} = 465 TB$ of storage per month. In 5 years this would account for around $30 PB$.

## Architecture
![Crawler Architecture](./images/crawler_architecture.png)

### Seed URL
A good seed URL serves as a good starting point that a crawler can utilize to traverse as many links as possible. There can be multiple different approaches of selecting a good set of seed URLs:
- categorise links geographically
- categorise links by topics, etc

### URL Frontier
Maintains list of URLs to crawl. A simple implementation would be FIFO queue, however it will not satisfy some properties expected of frontiers.

**Politness:** Most hyperlinks on the web are “relative” (i.e., refer to another page on the same web server). Frontier realized
as a FIFO queue contains long runs of URLs referring to pages on the same web server, resulting in the crawler issuing many consecutive HTTP requests to that server, leading to potential denial of service attack.

One way to achieve this is my maintaining separate queue per crawler (downloader) thread (or machine). The domain part of URL is hashed and then the URL is stored in a queue. This way only one queue would contain all URLs of a domain, thereby slowing down requests to that domain.

![Politeness](./images/politeness.png)

**Priority:** a good web crawler associates a priority with web pages based on the pages usefulness. Higher priority means more likely to be downloaded. There can be multiple ways to determine priority - page traffic, rate of change, *PageRank*, etc.

### Storage for Frontier
The best place to store frontier data would be memory, however it would soon fill up. It can be stored on disk, but then it will be too slow to access. A hybrid approach is taken where majority of URLs are stored on disk. A small buffer is maintained in memory for enqueue/dequeue operations.

## Downloader
Downloads webpages from the internet based on the URL provided by the frontier. Downloader keeps in consideration the *robots.txt* file which specifies what pages a crawler is permitted to download. Snippet of contents of robots.txt for apple.com:
```
User-agent: *
Disallow: /*shop/browse/overlay/*
Disallow: /*shop/iphone/payments/overlay/*
Disallow: /cn/*/aow/*
Disallow: /tmall*
Allow: /ac/globalnav/2.0/*/images/ac-globalnav/globalnav/search/* 	
```

It is good practise to cache contents of this file. To improve upon the performance of download:
- **Distributed Crawl:** crawl jobs are distributed into multiple servers, and each server runs multiple threads. The URL space is partitioned into smaller pieces; so, each downloader is responsible for a subset of the URLs.
- **DNS Cache:** maintain local DNS cache as repeated DNS queries can slow down the process
- **Geographical Distribution:** crawler geographically closer to a server can download content faster, so distributing downloader globally can speed up the process.
- **Short Timeout:** set short timeouts. This would prevent slow websites from being bottleneck.

The robustess of crawling process can be improved by:
- **Consistent Hashing:** having consistent hash based architecture such that new downloaders can be added and removed without redistributing all the URLs to be downloaded. This process needs to be complemented with a storage system to save crawl state.
- **Exception Handling:** downloaders should gracefully handle any exception generated during the process. Some pages called as *spider traps* can lead to infinite loop by creating an infinite directory structure. Example. example.com/foo/bar/foo/bar/foo... One way to detect this is by limiting URL length.


## Content Parser
Downloaded webpages need to be parsed and validated for correctness. It is often a separate component as compared to downloader to prevent slowing down the downloader.

## Content Seen
Since lot of the web contains duplicate content, it makes sense to check for whether the content has already been stored. This can be done by generating hash of the content and comparing it with hash of existing content.

## Link Extractor and Filter
Link extractor retrieves URLs from webpages. Since many links are relative, these are converted into absolute URLs by adding the domain of the current page. The URL filter excludes certain content types, file extensions, error links and URLs in blacklisted problematic domains.

## URL Seen
It is a data structure that keeps track of URLs that are visited before or already in the Frontier. It helps to avoid adding the same URL multiple times as this can increase server load and cause potential infinite loops.
*Bloom filter* and *hash table* are common techniques to implement the URL Seen? component. 