…andidate @titles, add tests and update corpus
New Pismo::Document#tags method.
Support extracting tags from documents.
Support punctuation in dates.
Such as periods after an abbreviated month or day of week and commas before the year.
The metadata_expected.yaml includes datetimes with a +01:00 time zone. Which means the test will fail for anybody running the test in a different time zone. Explicitly setting the time zone to UTC in the helper solves this problem.
* Rewrote huge parts of internal_document to be more DRY and produce less garbage * Integrated the htmlentities gem for generalized HTML entity decoding * Fixed HTML entity decoding so that it happens when content is extracted, rather than doing it on the source document, which can break parsing * Stubbed out the network calls in the test suite, resulting in dramatically faster tests * General garbage, speed, and style tweaks * Removed trailing whitespace from many files * Make the ImageExtractor logger customizable, or pass false for no logger * In the same vein, use default options and pass them along down to the various pieces of the parser
…ing was enforced, leading to an invalid UTF-8 encoding exception