Crawler

Crawler is a simple web crawler expressed as a CLI tool written in Go.

It will attempt to crawl a given domain and extract all the links found on that domain. It will then follow those links and extract the information from them as well, until it has retrieved them all.

Crawler will only parse and follow links associated with the given domain.

For example, if the domain is https://example.com, then it would only parse links that are part of the example.com website. It will ignore external links, such as https://google.com or even prefixed domains like https://app.example.com.

Requirements

In order to use Crawler, a valid domain must be provided. This can be any valid website, such as https://google.com or https://www.wikipedia.org/.

Crawler was built using Go 1.22, so in order to run or build the application, users must have the Go programming language installed.

Instructions on how to install Go can be found here.

Using the CLI

Once you've selected a domain to crawl and have installed Go, Crawler can be used by just running the program from the root of the repo.

Crawler takes two arguments:

--domain - this is mandatory and represents the domain you wish Crawler to crawl. Format: https://example.com
--concurrency - this represents the amount of concurrent workers you wish Crawler to spawn. The more workers, the faster it runs, but it will consume more resources. This argument is a simple integer. Default is 5.

go run cmd/crawler/main.go --domain "<CHOSEN_DOMAIN>"

Example with concurrency

go run cmd/crawler/main.go --domain "<CHOSEN_DOMAIN>" --concurrency 10

Output:

Page https://go.dev/ has links: [https://go.dev/solutions/case-studies https://go.dev/solutions/use-cases https://go.dev/security/ https://go.dev/learn/ https://go.dev/doc/effective_go https://go.dev/doc https://go.dev/doc/devel/release https://go.dev/talks/ https://go.dev/wiki/Conferences https://go.dev/blog https://go.dev/help https://go.dev/dl https://go.dev/dl/ https://go.dev/solutions/ https://go.dev/solutions/google/ https://go.dev/solutions/paypal https://go.dev/solutions/americanexpress https://go.dev/solutions/mercadolibre https://go.dev/tour/ https://go.dev/solutions/cloud/ https://go.dev/solutions/clis/ https://go.dev/pkg/net/http/ https://go.dev/pkg/html/template/ https://go.dev/pkg/database/sql/ https://go.dev/solutions/webdev/ https://go.dev/solutions/devops/ https://go.dev/doc/install/ https://go.dev/learn https://go.dev/play https://go.dev/help/ https://go.dev/pkg/ https://go.dev/project https://go.dev/blog/ https://go.dev/brand https://go.dev/conduct https://go.dev/copyright https://go.dev/tos https://go.dev/s/website-issue]

Page https://go.dev/tos has links: [https://go.dev/solutions/case-studies https://go.dev/solutions/use-cases https://go.dev/security/ https://go.dev/learn/ https://go.dev/doc/effective_go https://go.dev/doc https://go.dev/doc/devel/release https://go.dev/talks/ https://go.dev/wiki/Conferences https://go.dev/blog https://go.dev/help https://go.dev/solutions/ https://go.dev/play https://go.dev/tour/ https://go.dev/help/ https://go.dev/pkg/ https://go.dev/project https://go.dev/dl/ https://go.dev/blog/ https://go.dev/brand https://go.dev/conduct https://go.dev/copyright https://go.dev/tos https://go.dev/s/website-issue]

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
cmd/crawler		cmd/crawler
internal		internal
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
go.mod		go.mod
go.sum		go.sum

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Crawler

Requirements

Using the CLI

About

Releases

Packages

Languages

License

adrianos93/crawler

Folders and files

Latest commit

History

Repository files navigation

Crawler

Requirements

Using the CLI

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages