Distributed Web Crawler

My attempt at making a distributed web crawler using Golang, Kafka, MongoDB, and AWS S3.

Motivation

Part of the motivation for this project was as part of building a new website for my grandfather's work, there was lots of content on many pages that would have to be transferred to the new website. I thought it would be a good idea to make a web crawler that would crawl the old website and store the content in a database. This way I could easily transfer the content to the new website.

At the same time, I wanted to learn more about distributed systems and Kafka in particular, so I decided to make use of Kafka to help manage the distribution of the page crawling. I also wanted to further my experience working with Go and MongoDB, so this was a perfect project to do so.

Architecture

The architecture of the system is as follows:

The system is made up of 2 main services:

Site fetching service:
- This service is responsible for fetching the HTML content (and therefore making sure it is alive) of a website and storing it to an object store, such as AWS S3.
- The service then adds the message to another queue for processing, which is consumed by the processing service
- The service is also responsible for managing retries in case a site is unreachable on first fetch, re-assigning them to the site queue until a threshold is reached then placing them on a dead letter queue (DLQ).
Site processing service:
- This service is responsible for processing the HTML content of a website and extracting the links and content from the page.
  - This includes extracting the title, meta tags, other links, and text body of the page.
- The service then stores the extracted content in a database, such as MongoDB, and to a search service, such as Elasticsearch for lookup (optional).
- The service then adds the extracted links to the queue for the site fetching service to fetch, if they haven't already been ingested into the MongoDB database.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
.vscode		.vscode
site-fetching-service		site-fetching-service
site-processing-service		site-processing-service
.gitignore		.gitignore
README.md		README.md
distributed-web-crawler-architecture-sketch.png		distributed-web-crawler-architecture-sketch.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Distributed Web Crawler

Motivation

Architecture

About

Uh oh!

Releases

Packages

Uh oh!

Languages

andrewbapham/distributed-web-crawler

Folders and files

Latest commit

History

Repository files navigation

Distributed Web Crawler

Motivation

Architecture

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages