BoilerText is a Go implementation of the algorithm to remove boilerplate text from HTML files as described by http://www.l3s.de/~kohlschuetter/boilerplate. The paper is found here (PDF). The intent of BoilerText output is for full-text search indexing.
The reference implementation is found in https://github.com/PageDash/boilerpipe (forked from https://github.com/kohlschutter/boilerpipe). This implementation does its best to mimick the algorithm described in the paper, but isn't 100% the same as the boilerpipe
implementation.
By no means idiomatic Go. We'll get there. PRs welcome to clean up stuff or to add new algorithms.
See example usage in https://github.com/PageDash/boilertext/blob/master/main.go
There are two possible split strategies that you will want to consider. For English and English-like languages (which consists of words formed by a sequence of characters), the bufio.ScanWords
SplitFunc
is appropriate. For languages such as Chinese and Japanese (which consists of rune characters), use the bufio.ScanRunes
SplitFunc
to obtain the desired result. Obviously this is a simplistic view, but we gotta start somewhere.
Note that the research algorithm was based on the English language. YMMV for other languages. We found that replacing word split with rune split for runic languages performed decently.
See https://github.com/abadojack/whatlanggo for language detection feature support.
I did a benchmark, and it actually shows that naive string concatenation is faster than bytes.Buffer. And since most HTML is sort of lightweight with text block count in the order of hundreds, string concatenation will be just fine. My results corroborate with https://github.com/hermanschaaf/go-string-concat-benchmarks.