distributed-web-crawler

course project for Introduction of Golang on imooc.

Tech stack

Golang 1.11
Elasticsearch 6.5.4
Docker

单机版 / Single node version

/crawler

分布式版 / Distributed version

/crawler-distributed

前端页面 / Simple front end page

/frond-end

启动 / Run

单机版 / Single node version :

docker run -d -p 9200:9200 elasticsearch:x.x.x (your es version)

(under project root directory) cd crawler

go run main.go

分布式版 / Distributed version :

docker run -d -p 9200:9200 elasticsearch:x.x.x (your es version)

(under crawler-distributed) cd persist

go run itemSaver.go

(under crawler-distributed) cd worker/server

go run worker.go (start as many server as you want, as long as you add port configuration and set them in config.go)

(under project root directory) cd crawler-distributed

go run main.go

页面 / Simple front end page :

(under project root directory) cd front-end

go run start.go

Todo List

Crawl more website, with css selector or xpath (instead of regular expression).
Handle with anti-crawl mechanism (qps limit, encrypted cookie), or follow robots agreement.
Login mechanism.
Put De-dup in a separate module (with Redis).
Optimize ES search quality.
A more handy front-end page.
Use these data to play with AI.
Use Docker + Kubernetes to package and deploy.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
crawler-distributed		crawler-distributed
crawler		crawler
front-end		front-end
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

distributed-web-crawler

Tech stack

单机版 / Single node version

分布式版 / Distributed version

前端页面 / Simple front end page

启动 / Run

单机版 / Single node version :

分布式版 / Distributed version :

页面 / Simple front end page :

Todo List

About

Uh oh!

Releases

Packages

Uh oh!

Languages

CunjunWang/distributed-web-crawler

Folders and files

Latest commit

History

Repository files navigation

distributed-web-crawler

Tech stack

单机版 / Single node version

分布式版 / Distributed version

前端页面 / Simple front end page

启动 / Run

单机版 / Single node version :

分布式版 / Distributed version :

页面 / Simple front end page :

Todo List

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages