Crawph

A concurrent web crawler written in Go that builds a directed graph of website link structures. Each crawled page becomes a vertex, and every hyperlink becomes an edge, producing a full map of how pages connect to each other.

How it works

Crawph models a website as a directed graph:

Vertices represent discovered URLs, indexed by full URL and by domain
Edges represent hyperlinks between pages
The graph is thread-safe — concurrent workers add vertices and edges without races
URL normalization ensures each page appears as a single vertex, avoiding duplicates

The output graph can be serialized to JSON (for inspection and tooling) or binary (gob, for compact storage and fast reload).

Features

Directed graph construction with concurrent-safe vertex/edge insertion
Dual indexing (full URL + domain) for fast lookups
Pipeline architecture (fetch → extract → store)
robots.txt compliance with Crawl-delay support
Per-domain rate limiting
URL normalization and deduplication
Configurable crawl depth and worker pool
JSON and binary graph serialization
YAML configuration file support

Installation

go install github.com/alexmar07/crawler-go/cmd@latest

Or build from source:

task
# Binary is at bin/crawph

Usage

Quick start

crawph -urls https://example.com

With configuration file

crawph -config crawph.yml

CLI flags

Flag	Description	Default
`-urls`	Comma-separated seed URLs
`-config`	Path to YAML config file
`-workers`	Number of concurrent workers	5
`-depth`	Maximum crawl depth	10
`-output`	Output file path	data/result
`-format`	Output format (json\|binary)	json
`-timeout`	HTTP request timeout	30s

CLI flags override config file values.

Configuration file

seeds:
  - https://example.com

crawl:
  max_depth: 10
  max_workers: 5
  timeout: 30s
  user_agent: "Crawph/1.0"

rate_limit:
  default_rps: 1.0
  respect_crawl_delay: true

robots:
  enabled: true

storage:
  format: json
  output: data/result

Development

# Run tests
task test

# Build
task

# Clean
task clean

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
.github/workflows		.github/workflows
cmd		cmd
config		config
crawph		crawph
extractor		extractor
fetcher		fetcher
graph		graph
queue		queue
ratelimit		ratelimit
robots		robots
.gitignore		.gitignore
.golangci.yml		.golangci.yml
LICENSE		LICENSE
README.md		README.md
Taskfile.yml		Taskfile.yml
go.mod		go.mod
go.sum		go.sum

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Crawph

How it works

Features

Installation

Usage

Quick start

With configuration file

CLI flags

Configuration file

Development

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Crawph

How it works

Features

Installation

Usage

Quick start

With configuration file

CLI flags

Configuration file

Development

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages