Skip to content

akeil/scrapen

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Scrapen

Scrapen is a tool to download article content from web pages.

Features include:

  • following redirects
  • extract the main content (article) from the web site
  • clean up the resulting HTML
  • download referenced images
  • extract additional metadata

Status

Early development, but should be usable.

Usage

package main

import (
    "fmt"
    "github.com/akeil/scrapen"
)

func main() {
    url := "https://golang.org/doc/effective_go"

    article, err := scrapen.Scrape(url, nil)
    if err != nil {
        fmt.Printf("Scrape failed: %v\n", err)
        return
    }

    fmt.Printf("Title: %v\n", article.Title)
    fmt.Printf("Site: %v\n", article.Site)
    fmt.Printf("Final URL: %v\n", article.ActualURL)
    fmt.Println("HTML:")
    fmt.Println(article.HTML)
}

The behavior of a scraping task can be controlled with the Options struct:

o := &scrapen.Options{
    Metadata:       true,
    Readability:    true,
    Clean:          true,
    DownloadImages: true,
    Store:          MyStoreImplementation(),
}
article, err := scrapen.Scrape(url, o)

CLI

A small command line tool is included.

Usage

$ scrapen https://golang.org/doc/effective_go

Will write the resulting HTML page to a local file ./output.html.

About

Scrapes articles from websites

Resources

Stars

Watchers

Forks

Packages

No packages published