Skip to content

Getting Started

zhengchun edited this page Dec 29, 2017 · 4 revisions

1. Install

go get github.com/antchfx/antch

2. Defining our Item

type item struct {
    Title string `json:"title"`
    Link  string `json:"link"`
    Desc  string `json:"desc"`
}

3. Our first Spider

Create a struct called dmozSpider that implement Handler interface.

type dmozSpider struct {}

func (s *dmozSpider) ServeSpider(c chan<- antch.Item, res *http.Response) {}

dmozSpider will extracting data from received pages and pass data into Pipeline.

doc, err := antch.ParseHTML(res)
for _, node := range htmlquery.Find(doc, "//div[@id='site-list-content']/div") {
    v := new(item)
    v.Title = htmlquery.InnerText(htmlquery.FindOne(node, "//div[@class='site-title']"))
    v.Link = htmlquery.SelectAttr(htmlquery.FindOne(node, "//a"), "href")
    v.Desc = htmlquery.InnerText(htmlquery.FindOne(node, "//div[contains(@class,'site-descr')]"))
    c <- v
}

htmlquery package, that supports XPath expression extracting data, and then send Item toGo'Channel c.

c <- v

4. Our first Pipeline

Create new Pipeline called jsonOutputPipeline, implements PipelineHandler interface.

jsonOutputPipeline serialize received Item data as JSON format print into console.

type jsonOutputPipeline struct {}

func (p *jsonOutputPipeline) ServePipeline(v Item) {
	b, err := json.Marshal(v)
	if err != nil {
		panic(err)
	}
	os.Stdout.Write(b)
}

5. Crawler

Create a new web crawler instance.

crawler := antch.NewCrawler()

You can enables middleware for HTTP cookies or robots.txt if you want.

  • enable cookies middleware for web crawler.
crawler.UseCookies()
  • you even registers custom middleware for web crawler.
crawler.UseMiddleware(CustomMiddleware())

6. Register Spider and Pipeline

Register dmozSpider to the web crawler instance.

dmozSpider will process all matches pages if its matches by dmoztools.net pattern.

crawler.Handle("dmoztools.net", &dmozSpider{})

Register jsonOutputPipeline to the web crawler instance.

crawler.UsePipeline(newTrimSpacePipeline(), newJsonOutputPipeline())

7. Running

startURLs := []string{
    "http://dmoztools.net/Computers/Programming/Languages/Python/Books/",
    "http://dmoztools.net/Computers/Programming/Languages/Python/Resources/",
}
crawler.StartURLs(startURLs)

END

go run main.go

Enjoy it.

Source Code

https://github.com/antchfx/antch-getstarted