Port improvements from go-htmldate #30

RadhiFadlillah · 2021-07-13T15:47:29Z

Overview

While porting this library into Go language, I've tried to made some improvements to make the extraction more accurate. After more testing, it looks like those improvements are good and stable enough to use so I decided to implement those improvements back to Python here.

Changes

There are three main changes in this PR:

Add French and Indonesian language to regular expressions that used to parse long date string.

This is done to fix htmldate failed to extract date from paris-luttes.info.html which uses French language. Since I added a new language to the regular expressions, I decided to add Indonesian language as well.
Improve custom_parse.

Now it works by trying to parse the string using several formats with following priority:
- YYYYMMDD pattern
- YYYY-MM-DD (ISO-8601)
- DD-MM-YYYY (most common used date format according to Wikipedia)
- YYYY-MM pattern
- Regex patterns
Merge xpath selectors from array of strings into a single string.

This is done to fix htmldate extracted the wrong date for wolfsrebellen-netz.forumieren.com.regeln.html. Consider HTML document like this:
```
<div>
	<h1>Some Title</h1>
	<p class="author">By Joestar at 2020/12/12, 10:11 AM</p>
	<p>Lorem ipsum dolor sit amet.</p>
	<p>Dolorum explicabo quos voluptas voluptates?</p>
	<p class="current-time">Current date and time: 2021/07/14, 09:00 PM</p>
</div>
```
In document above, there are two dates: one in element with class "author" and the other in element with class "current-time".

In the original code, htmldate will pick the date from element in "current-time" even though it's occured later in the document. This is because currently DATE_EXPRESSIONS is created as array of Xpath selectors, and in that array element with classes that contains time is given more priority than element with classes that contains author.

To fix this, I've converted DATE_EXPRESSIONS and other Xpath selectors from array of strings into a single string. This way every rules inside the expressions has same priority, so now the <p class="author"> will be selected first.

Result

Here is the result of comparison test for the original htmldate:

Package	Precision	Recall	Accuracy	F-Score	Speed (s)
`htmldate` fast	0.899	0.917	0.831	0.908	1.038
`htmldate` extensive	0.893	1.000	0.893	0.944	2.219

And here is after this PR:

Package	Precision	Recall	Accuracy	F-Score	Speed (s)
`htmldate` fast	0.920	0.938	0.867	0.929	1.579
`htmldate` extensive	0.911	1.000	0.911	0.953	2.807

So there is a slight increase in accuracy, however the extraction speed become slower (around 1.5x slower than the original).

Additional Notes

I've not added it to this PR, however since custom_parse has been improved, from what I test we can safely remove external_date_parser without any performance loss. Here is the result of comparison test after external_date_parser removed:

Package	Precision	Recall	Accuracy	F-Score	Speed (s)
`htmldate` fast	0.920	0.938	0.867	0.929	1.678
`htmldate` extensive	0.911	1.000	0.911	0.953	1.816

So the accuracy is still the same, however the extraction speed for extensive mode become a lot faster (now only 1.08x slower than the fast mode) so we might be able to make the extensive mode as default. Might need more tests though.

codecov-commenter · 2021-07-14T01:44:54Z

Codecov Report

Merging #30 (544ce94) into master (d6d34d3) will not change coverage.
The diff coverage is 100.00%.

❗ Current head 544ce94 differs from pull request most recent head 60bd2bc. Consider uploading reports for the commit 60bd2bc to get more accurate results

@@           Coverage Diff           @@
##           master      #30   +/-   ##
=======================================
  Coverage   92.36%   92.36%           
=======================================
  Files           7        7           
  Lines         943      943           
=======================================
  Hits          871      871           
  Misses         72       72

Impacted Files	Coverage Δ
htmldate/core.py	`95.75% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d6d34d3...60bd2bc. Read the comment docs.

htmldate/extractors.py

htmldate/settings.py

adbar · 2021-07-15T13:46:15Z

tests/unit_tests.py

@@ -386,8 +386,8 @@ def test_try_ymd_date():
    assert try_ymd_date('Am 1. September 2017 um 15:36 Uhr schrieb', OUTPUTFORMAT, True, MIN_DATE, LATEST_POSSIBLE) == '2017-09-01'
    assert try_ymd_date('Fri - September 1 - 2017', OUTPUTFORMAT, True, MIN_DATE, LATEST_POSSIBLE) == '2017-09-01'
    assert try_ymd_date('1.9.2017', OUTPUTFORMAT, True, MIN_DATE, LATEST_POSSIBLE) == '2017-09-01'
-    assert try_ymd_date('1/9/17', OUTPUTFORMAT, True, MIN_DATE, LATEST_POSSIBLE) == '2017-01-09' # assuming MDY format
-    assert try_ymd_date('201709011234', OUTPUTFORMAT, True, MIN_DATE, LATEST_POSSIBLE) == '2017-09-01'
+    assert try_ymd_date('1/9/17', OUTPUTFORMAT, True, MIN_DATE, LATEST_POSSIBLE) == '2017-09-01' # assuming MDY format


That's tricky...

adbar · 2021-07-15T13:50:54Z

Hi @RadhiFadlillah, thank you very much for the improvements!

There is still some work to do on the PR to make it more pythonic, especially the use of external Python modules in custom_parse should stay as it is IMHO, mostly for speed reasons. 1.5x slower is not huge but it's still significant.

Besides, we have a good opportunity with PR #29 to test the changes on a more global dataset, I would want to do it before merging.

adbar · 2021-08-03T17:52:33Z

First experiments on the new (combined dataset):

Current

Package	Precision	Recall	Accuracy	F-score	Time (wrt htmldate fast)
htmldate extensive	0.831	0.989	0.824	0.903226	2.03x
htmldate fast	0.838	0.903	0.769	0.869681	1.00x

This PR

Package	Precision	Recall	Accuracy	F-score	Time (wrt htmldate fast)
htmldate extensive	0.836	0.989	0.828	0.906049	1.78x
htmldate fast	0.851	0.905	0.781	0.877147	1.00x

→ Small improvement, but also slightly slower.

adbar · 2021-08-04T16:55:32Z

Hi @RadhiFadlillah, I worked on the PR and used the new benchmark. It seems we still have a bit work to do.

I'm not convinced by the removal of external parsing components yet, we could test it thoroughly later.

If you're OK with the changes I see two remaining steps:

The numbers in the Readme need to be adapted to the new benchmark
Smaller parts of the code are untested (see codecov diff)

adbar · 2021-08-05T15:53:20Z

Results with new evaluation data in 544ce94:

Current

Package	Precision	Recall	Accuracy	F-score	Time (wrt fast mode)
htmldate extensive	0.819	0.990	0.812	0.896247	1.80x
htmldate fast	0.826	0.905	0.760	0.863636	1.00x

This PR

Package	Precision	Recall	Accuracy	F-score	Time (wrt fast mode)
htmldate extensive	0.825	0.990	0.818	0.89989	1.59x
htmldate fast	0.839	0.906	0.772	0.871332	1.00x

→ Still a bit better.

adbar · 2021-08-13T12:25:44Z

Hi @RadhiFadlillah, I'm now merging this PR in its current version, feel free to make further contributions!

RadhiFadlillah added 7 commits July 13, 2021 13:37

Add French and Indonesian language to regex

ed93871

Improve custom date parser

a2ddbf5

Remove unneeded(?) regex

9ca4fbf

Merge xpath expressions

5ec87d5

Fix wrong xpath query and increase n candidate limit

87562ac

Update readme

99c75ce

Convert tabs to spaces for consistency

1fcbc59

RadhiFadlillah marked this pull request as ready for review July 14, 2021 03:22