Skip to content
Geziyor, a fast web crawling & scraping framework for Go. Supports JS rendering.
Go
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
cache Caching policies added. Jul 7, 2019
client Call ErrorFunc on errors. Unexport DoRequestClient and DoRequestChrome Dec 12, 2019
export Go 1.10 and 1.11 support added by using different methods on reflect … Jul 21, 2019
internal Retry requests support implemented for client. Jul 4, 2019
metrics Robotstxt metrics added. Jul 8, 2019
middleware Call ErrorFunc on errors. Unexport DoRequestClient and DoRequestChrome Dec 12, 2019
.gitignore Extractors implemented. Exporters name simplified. README Updated for… Jun 28, 2019
.travis.yml Added response joinURL test and updated chromedp. Sep 13, 2019
CONTRIBUTING.md Update CONTRIBUTING.md Jun 15, 2019
LICENSE.txt
README.md
geziyor.go Call ErrorFunc on errors. Unexport DoRequestClient and DoRequestChrome Dec 12, 2019
geziyor_test.go Added user-agent test, Fixed failing test Aug 5, 2019
go.mod Updated chrome protocol library Nov 16, 2019
go.sum Updated chrome protocol library Nov 16, 2019
options.go Call ErrorFunc on errors. Unexport DoRequestClient and DoRequestChrome Dec 12, 2019

README.md

Geziyor

Geziyor is a blazing fast web crawling and web scraping framework. It can be used to crawl websites and extract structured data from them. Geziyor is useful for a wide range of purposes such as data mining, monitoring and automated testing.

GoDoc report card Code Coverage

Features

  • JS Rendering
  • 5.000+ Requests/Sec
  • Caching (Memory/Disk/LevelDB)
  • Automatic Data Exporting (JSON, CSV, or custom)
  • Metrics (Prometheus, Expvar, or custom)
  • Limit Concurrency (Global/Per Domain)
  • Request Delays (Constant/Randomized)
  • Cookies, Middlewares, robots.txt
  • Automatic response decoding to UTF-8

See scraper Options for all custom settings.

Status

We highly recommend you to use Geziyor with go modules.

Usage

This example extracts all quotes from quotes.toscrape.com and exports to JSON file.

func main() {
    geziyor.NewGeziyor(&geziyor.Options{
        StartURLs: []string{"http://quotes.toscrape.com/"},
        ParseFunc: quotesParse,
        Exporters: []export.Exporter{&export.JSON{}},
    }).Start()
}

func quotesParse(g *geziyor.Geziyor, r *client.Response) {
    r.HTMLDoc.Find("div.quote").Each(func(i int, s *goquery.Selection) {
        g.Exports <- map[string]interface{}{
            "text":   s.Find("span.text").Text(),
            "author": s.Find("small.author").Text(),
        }
    })
    if href, ok := r.HTMLDoc.Find("li.next > a").Attr("href"); ok {
        g.Get(r.JoinURL(href), quotesParse)
    }
}

See tests for more usage examples.

Documentation

Installation

go get -u github.com/geziyor/geziyor

If you want to make JS rendered requests, make sure you have Chrome installed.

NOTE: macOS limits the maximum number of open file descriptors. If you want to make concurrent requests over 256, you need to increase limits. Read this for more.

Making Normal Requests

Initial requests start with StartURLs []string field in Options. Geziyor makes concurrent requests to those URLs. After reading response, ParseFunc func(g *Geziyor, r *Response) called.

geziyor.NewGeziyor(&geziyor.Options{
    StartURLs: []string{"http://api.ipify.org"},
    ParseFunc: func(g *geziyor.Geziyor, r *client.Response) {
        fmt.Println(string(r.Body))
    },
}).Start()

If you want to manually create first requests, set StartRequestsFunc. StartURLs won't be used if you create requests manually.
You can make requests using Geziyor methods:

geziyor.NewGeziyor(&geziyor.Options{
    StartRequestsFunc: func(g *geziyor.Geziyor) {
    	g.Get("https://httpbin.org/anything", g.Opt.ParseFunc)
        g.Head("https://httpbin.org/anything", g.Opt.ParseFunc)
    },
    ParseFunc: func(g *geziyor.Geziyor, r *client.Response) {
        fmt.Println(string(r.Body))
    },
}).Start()

Making JS Rendered Requests

JS Rendered requests can be made using GetRendered method. By default, geziyor uses local Chrome application CLI to start Chrome browser. Set BrowserEndpoint option to use different chrome instance. Such as, "ws://localhost:3000"

geziyor.NewGeziyor(&geziyor.Options{
    StartRequestsFunc: func(g *geziyor.Geziyor) {
        g.GetRendered("https://httpbin.org/anything", g.Opt.ParseFunc)
    },
    ParseFunc: func(g *geziyor.Geziyor, r *client.Response) {
        fmt.Println(string(r.Body))
    },
    //BrowserEndpoint: "ws://localhost:3000",
}).Start()

Extracting Data

We can extract HTML elements using response.HTMLDoc. HTMLDoc is Goquery's Document.

HTMLDoc can be accessible on Response if response is HTML and can be parsed using Go's built-in HTML parser If response isn't HTML, response.HTMLDoc would be nil.

geziyor.NewGeziyor(&geziyor.Options{
    StartURLs: []string{"http://quotes.toscrape.com/"},
    ParseFunc: func(g *geziyor.Geziyor, r *client.Response) {
        r.HTMLDoc.Find("div.quote").Each(func(_ int, s *goquery.Selection) {
            log.Println(s.Find("span.text").Text(), s.Find("small.author").Text())
        })
    },
}).Start()

Exporting Data

You can export data automatically using exporters. Just send data to Geziyor.Exports chan. Available exporters

geziyor.NewGeziyor(&geziyor.Options{
    StartURLs: []string{"http://quotes.toscrape.com/"},
    ParseFunc: func(g *geziyor.Geziyor, r *client.Response) {
        r.HTMLDoc.Find("div.quote").Each(func(_ int, s *goquery.Selection) {
            g.Exports <- map[string]interface{}{
                "text":   s.Find("span.text").Text(),
                "author": s.Find("small.author").Text(),
            }
        })
    },
    Exporters: []export.Exporter{&export.JSON{}},
}).Start()

Benchmark

8748 request per seconds on Macbook Pro 15" 2016

See tests for this benchmark function:

>> go test -run none -bench Requests -benchtime 10s
goos: darwin
goarch: amd64
pkg: github.com/geziyor/geziyor
BenchmarkRequests-8   	  200000	    108710 ns/op
PASS
ok  	github.com/geziyor/geziyor	22.861s
You can’t perform that action at this time.