Web-scraper library for Golang

web-scraper is a small library for parsing and scraping the Html. It is built on top of golang.org/x/net/html

Installation

The go version has to be with go modules. Type the following command inside the working directory where the go.mod file is:

go get github.com/genjik/web-scraper

Documentation

The Element type contains a pointer to the html.Node. The whole API uses Element type to return html elements.

type Element struct {
    node *html.Node
}

GetRootElement takes Html as any type as long as it satisfies io.Reader. The function returns the Element type that contains pointer to the <html> node

GetRootElement(r io.Reader) (Element, error)

Retrieve raw text from element

func (e Element) GetText() string

Search for child elements

func (e Element) FindOne(tag string, recursive bool, attrs ...string) Element
func (e Element) FindAll(tag string, recursive bool, limit int, attrs ...string) []Element

Search for parent elements

func (e Element) FindParent(tag string, attrs ...string) Element
func (e Element) FindParents(tag string, limit int, attrs ...string) []Element

Search for sibling elements

func (e Element) FindPrevSibling(tag string, attrs ...string) Element
func (e Element) FindNextSibling(tag string, attrs ...string) Element
func (e Element) FindPrevSiblings(tag string, limit int, attrs ...string) []Element
func (e Element) FindNextSiblings(tag string, limit int, attrs ...string) []Element

Get an element

func (e Element) Parent() Element // Returns parent element
func (e Element) FirstChild() Element // Not supported yet
func (e Element) PrevSibling() Element // Not supported yet
func (e Element) NextSibling() Element // Not supported yet

Parameters:
tag string The tag name of element. E.g html/head/body/div/span/h1 and so on.

attrs ...string The attributes of element the method will search for. E.g {"class", "className"}. As many arguments as neccesary can be passed to the parameter, or it can be ommited at all

recursive bool "false" tells a method to look only for the elements that are children for the current element. "true" tells the method to look for child elements until it reaches the last element of html tree.

limit int The number is used to limit the size of final result. -1 means no limit

Example

package main

import (
    "strings"
    "github.com/genjik/web-scraper"
    "fmt"
)

func main() {
    r := strings.NewReader(`
        <html>
            <head></head>
            <body>
                <div id="red" class="box">
                    <div id="special">Special Message</div> 
                </div>

                <div id="green" class="box">
                    <div>
                        <div class="list-item" id="l1">List#1</div>
                        <div class="list-item" id="l2">List#2</div>
                        <div class="list-item" id="l3">List#3</div>
                        <div class="list-item" id="l4">List#4</div>
                        <div class="list-item" id="l5">List#5</div>
                    </div>
                </div>
            </body>
        </html>
    `)

    root, err := webscraper.GetRootElement(r)
    if err != nil {
        // Error handling
    }

    el := root.FindOne("div", true, "id", "special").GetText()
    fmt.Println(el) // Special Message

    elements := root.FindAll("div", true, -1, "class", "list-item") 
    for _, element := range elements {
        fmt.Println(element.GetText()) // List#1-5
    }
}

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
.travis.yml		.travis.yml
LICENSE		LICENSE
README.md		README.md
child.go		child.go
child_test.go		child_test.go
element.go		element.go
element_test.go		element_test.go
go.mod		go.mod
go.sum		go.sum
parentSibling.go		parentSibling.go
parentSibling_test.go		parentSibling_test.go
string.go		string.go
string_test.go		string_test.go
traversal.go		traversal.go
traversal_test.go		traversal_test.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Web-scraper library for Golang

Installation

Documentation

Example

About

Releases

Packages

Languages

License

genjik/web-scraper

Folders and files

Latest commit

History

Repository files navigation

Web-scraper library for Golang

Installation

Documentation

Example

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages