Html Content / Article Extractor in Golang
HTML Go Other
Switch branches/tags
Nothing to show
Clone or download
advancedlogic Merge pull request #57 from rguimaraens/rguimaraens/newer-set-signatu…
…re-and-typing

Updates signature and typing for fatih/set
Latest commit 0101480 Aug 10, 2018
Permalink
Failed to load latest commit information.
sites more test; added "brass-rail" to the list of tags to clean Jan 20, 2017
.gitignore DEV-4442 ignore target dir, *.test and *.prof files Nov 4, 2015
.travis.yml improved coverage report and add support for travis-ci and coveralls Nov 22, 2015
Gopkg.lock [FIX] Image extraction Jul 29, 2017
Gopkg.toml [FIX] Image extraction Jul 29, 2017
LICENSE Initial commit Jan 10, 2014
Makefile New reports to improve code quality May 1, 2017
README.md Update README.md May 18, 2017
VERSION New reports to improve code quality May 1, 2017
article.go First pass at extraction of article date published field. Mar 18, 2018
charset.go use log.Println instead of fmt.Println Jun 6, 2017
charset_test.go make charset utilities public; extract charset detection / encoding t… Dec 1, 2015
cleaner.go [FIX] issue with convertDivsToParagraphs Dec 4, 2017
configuration.go First pass at extraction of article date published field. Mar 18, 2018
coverage.sh coverage script Nov 24, 2015
crawler.go Merge pull request #53 from jaytaylor/jay/crawl-error-handling Mar 24, 2018
crawler_test.go fix tests - in some cases, we are now doing a better job at selecting… Feb 20, 2017
doc.go Fix error reported by static analyzers May 1, 2017
extractor.go Updates signature and typing for fatih/set Aug 8, 2018
goose.go do not panic, return error instead Dec 7, 2015
goose.json Create goose.json Jan 11, 2014
images.go Update images scoring rules, and added a set of rules for classes May 20, 2018
outputformatter.go Fix error reported by static analyzers May 1, 2017
parser.go Fix double-printing of text nodes; NB: this fix could have been suffi… Jan 26, 2017
stopwords.go Updates signature and typing for fatih/set Aug 8, 2018
videos.go Updates signature and typing for fatih/set Aug 8, 2018
wordstats.go golint Sep 27, 2015

README.md

GoOse

HTML Content / Article Extractor in Golang

Build Status Coverage Status Go Report Card GoDoc

Description

This is a golang port of "Goose" originaly licensed to Gravity.com under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information regarding copyright ownership.

Golang port was written by Antonio Linari

Gravity.com licenses this file to you under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

INSTALL

go get github.com/advancedlogic/GoOse

HOW TO USE IT

package main

import (
	"github.com/advancedlogic/GoOse"
)

func main() {
	g := goose.New()
	article, _ := g.ExtractFromURL("http://edition.cnn.com/2012/07/08/opinion/banzi-ted-open-source/index.html")
	println("title", article.Title)
	println("description", article.MetaDescription)
	println("keywords", article.MetaKeywords)
	println("content", article.CleanedText)
	println("url", article.FinalURL)
	println("top image", article.TopImage)
}

Development - Getting started

This application is written in GO language, please refere to the guides in https://golang.org for getting started.

This project include a Makefile that allows you to test and build the project with simple commands. To see all available options:

make help

Before committing the code, please check if it passes all tests using

make deps
make qa

TODO

  • better organize code
  • improve "xpath" like queries
  • add other image extractions techniques (imagemagick)

THANKS TO

@Martin Angers for goquery
@Fatih Arslan for set
GoLang team for the amazing language and net/html