Skip to content

Willyham/gospider

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

gospider

gospider is a concurrent web spider. By default, it respects robots.txt entries.

Usage

make install

gospider is designed to be used either from the CLI or in code.

CLI

gospider can be run from the CLI by running

gospider start -r "http://foo.bar/" > out.html

By default, gospider writes an HTML sitemap to stdout.

Use gospider --help for more options.

Code

the spider.New function follows the functional options pattern. The only parameter which is required is the root URL - all others will be defaulted to sensible values if not supplied.

uri, _ := url.Parse("http://foo.bar/")

spider := spider.New(
  spider.WithRoot(uri),
  spider.WithConcurrency(5),
  spider.WithTimeout(time.Second * 2),
)

err = spider.Run()
if err != nil {
  log.Fatal("error running spider: ", err)
}
return spider.Report(os.Stdout)

Modularity

gospider ships with a simple HTML reporter and uses the default HTTP client to make requests. However, any requester or reporter can be used by supplying a struct which implements the Requester or Reporter interface. For example, to make requests through a proxy you could do:

type proxyRequester struct {
  client *http.Client
}

func (r *proxyRequester) Request(ctx context.Context, uri *url.URL) ([]byte, error) {
  res, err := r.client.Get(uri.String())
  // handle err, read body, etc.
  return body, nil
}

s := spider.New(
  WithRoot(...),
  WithRequester(&proxyRequester{
    client: &http.Client{
      Transport: &http.Transport{Proxy: http.ProxyURL(...)}
    }
  })
)

Concurrency

gospider uses a worker pool concurrency model. As URLs are found they are added to a queue. Each worker (controlled with the concurrency parameter) will poll the queue for work. Once the queue is empty, the worker pool is drained and the spider will stop.

About

Concurrent web spider in golang

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published