Skip to content
Crawl the web using nats.io and Go
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
ansible
cloudant
couchdb
elasticsearch
parse
webapp
.gitignore
LICENSE
README.md
build.sh
extractor.go
fetcher.go
run.sh

README.md

OwlCrawler

It's a distributed web crawler that uses nats.io to coordinate work, written in Go.

Dependencies

  • CouchDB 1.x (tested on 1.6.1)
  • gnatsd

Building.

Build the two workers

go build  -tags=fetcherExec -o fetcher fetcher.go && \
go build  -tags=extractorExec -o extractor extractor.go

Setup

  1. Setup couchdb with at least one admin user, you can follow the instructions here
  2. create a file .couchdb.json and place it in your $HOME directory

Sample .couchdb.json

```

{ "user": "user-here", "password": "super-secret-password", "url": "http://localhost:5984/owl-crawler" }

```
  1. create a file .gnatsd.json and place it in your $HOME directory

Sample .gnatsd.json

```
{
  "URL": "nats://owlcrawler:natsd_password@127.0.0.1:4222"
}
```
  1. Start gnatsd with a user and password (use a config file, but for a quick test you can pass parameters):
~/gnatsd --user owlcrawler --pass natsd_password

On terminal 1 run:

./extractor -logtostderr=true -v=3

On terminal 2 run:

./fetcher -logtostderr=true -v=3

On terminal 3 run:

cd webapp
go build && ./webapp -alsologtostderr=true

On terminal 4 run:

cd webapp
grunt serve
You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.