This web crawler crawls a given website and generates a report for all the internal and external links found during the crawl.
- Code Flow: https://codeflow.dananglin.me.uk/apollo/web-crawler
- GitHub: https://github.com/dananglin/web-crawler
- Go: A minimum version of Go 1.23.0 is required for building/installing the web crawler. Please go here to download the latest version.
Clone this repository to your local machine.
git clone https://github.com/dananglin/web-crawler.git
Build the application.
- Build with go
go build -o crawler . - Or build with mage if you have it installed.
mage build
Run the application specifying the website that you want to crawl.
./crawler [FLAGS] URL
- Crawl the Crawler Test Site.
./crawler https://crawler-test.com - Crawl the site using 3 concurrent workers and stop the crawl after discovering a maximum of 100 unique pages.
./crawler --max-workers 3 --max-pages 100 https://crawler-test.com - Crawl the site and print out a JSON report.
./crawler --max-workers 3 --max-pages 100 --format json https://crawler-test.com - Crawl the site and save the report to a CSV file.
mkdir -p reports ./crawler --max-workers 3 --max-pages 100 --format csv --file reports/report.csv https://crawler-test.com
You can configure the application with the following flags.
| Name | Description | Default |
|---|---|---|
max-workers |
The maximum number of concurrent workers. | 2 |
max-pages |
The maximum number of pages the crawler can discoverd before stopping the crawl. | 10 |
format |
The format of the generated report. Currently supports text, csv or json. |
text |
file |
The file to save the generated report to. Leave this empty to print to the screen instead. |