Permalink
Commits on Feb 20, 2017
  1. fix tests - in some cases, we are now doing a better job at selecting…

    … the top content, and not discarding things like paragraph titles; downside is, sometimes we are left with spurious <span> and <i> elements that are converted to the text "span" and "i" :-/
    quipo committed Feb 20, 2017
  2. go vet;

    quipo committed Feb 20, 2017
Commits on Jan 26, 2017
  1. remove debug code

    quipo committed Jan 26, 2017
  2. Fix double-printing of text nodes; NB: this fix could have been suffi…

    …cient PuerkitoBio/goquery#147  but I added a few more cleanup functions
    quipo committed Jan 26, 2017
Commits on Jan 20, 2017
Commits on Jun 22, 2016
Commits on Dec 11, 2015
Commits on Dec 10, 2015
Commits on Dec 9, 2015
  1. some more cleaning

    quipo committed Dec 9, 2015
  2. fix charset handling; reload sites/charset_euc_jp.html in the correct…

    … source charset (not already UTF-8)
    quipo committed Dec 9, 2015
Commits on Dec 8, 2015
Commits on Dec 7, 2015
  1. pre-compile regexp

    quipo committed Dec 7, 2015
  2. pass config to extractor

    quipo committed Dec 7, 2015
  3. gofmt

    quipo committed Dec 7, 2015
Commits on Dec 4, 2015
Commits on Dec 2, 2015
  1. move GetCleanTextAndLinks() to ContentExtractor so we do not need to …

    …expose the output formatter
    quipo committed Dec 2, 2015
  2. expose a few more methods to make it easier to cherry-pick functional…

    …ity from GoOse as a library; log unrecognised charsets
    quipo committed Dec 2, 2015
  3. not enough arguments to return

    quipo committed Dec 2, 2015
  4. split preprocessing from meta data extraction, so we can reuse the go…

    …query document without having to do the full metadata extraction unconditionally first
    quipo committed Dec 2, 2015
  5. removed URL helpers, as they were not doing anything in the current i…

    …mplementation. The raw HTML was converted from ISO-8859-1 if not already in UTF-8, even if it was NOT in ISO-8859-1 (then again the conversion was harmless as the body was never saved back - only causing a slowdown). If needed, we can implement them again by porting what done in https://github.com/GravityLabs/goose
    quipo committed Dec 2, 2015
  6. instead of the wrapper Article object, pass *goquery.Document object …

    …around directly (simpler code, better extensibility)
    quipo committed Dec 2, 2015
Commits on Dec 1, 2015
  1. replace regexps with simpler string.Split() when cleaning title; inst…

    …ead of the wrapper Article object, pass *goquery.Document object around directly (simpler code, better extensibility)
    quipo committed Dec 1, 2015
  2. make charset utilities public; extract charset detection / encoding t…

    …o separate method, and also check for meta tags in the format <meta charset="utf-8"> as a fallback
    quipo committed Dec 1, 2015
  3. remove debug code

    quipo committed Dec 1, 2015
  4. Make some extractor methods public so they can be called independentl…

    …y; added charset name cleanup/normalisation code to fix common misspellings; moved charset conversion code to separate file
    quipo committed Dec 1, 2015
  5. golint

    quipo committed Dec 1, 2015
Commits on Nov 27, 2015
  1. skip invalid byte sequences when encoding HTML to UTF-8, so we can st…

    …ill parse most of the content, instead of dropping all the content after the invalid byte sequences
    quipo committed Nov 27, 2015