New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Port improvements from go-htmldate #30
Conversation
Codecov Report
@@ Coverage Diff @@
## master #30 +/- ##
=======================================
Coverage 92.36% 92.36%
=======================================
Files 7 7
Lines 943 943
=======================================
Hits 871 871
Misses 72 72
Continue to review full report at Codecov.
|
@@ -386,8 +386,8 @@ def test_try_ymd_date(): | |||
assert try_ymd_date('Am 1. September 2017 um 15:36 Uhr schrieb', OUTPUTFORMAT, True, MIN_DATE, LATEST_POSSIBLE) == '2017-09-01' | |||
assert try_ymd_date('Fri - September 1 - 2017', OUTPUTFORMAT, True, MIN_DATE, LATEST_POSSIBLE) == '2017-09-01' | |||
assert try_ymd_date('1.9.2017', OUTPUTFORMAT, True, MIN_DATE, LATEST_POSSIBLE) == '2017-09-01' | |||
assert try_ymd_date('1/9/17', OUTPUTFORMAT, True, MIN_DATE, LATEST_POSSIBLE) == '2017-01-09' # assuming MDY format | |||
assert try_ymd_date('201709011234', OUTPUTFORMAT, True, MIN_DATE, LATEST_POSSIBLE) == '2017-09-01' | |||
assert try_ymd_date('1/9/17', OUTPUTFORMAT, True, MIN_DATE, LATEST_POSSIBLE) == '2017-09-01' # assuming MDY format |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's tricky...
Hi @RadhiFadlillah, thank you very much for the improvements! There is still some work to do on the PR to make it more pythonic, especially the use of external Python modules in Besides, we have a good opportunity with PR #29 to test the changes on a more global dataset, I would want to do it before merging. |
First experiments on the new (combined dataset): Current
This PR
→ Small improvement, but also slightly slower. |
Hi @RadhiFadlillah, I worked on the PR and used the new benchmark. It seems we still have a bit work to do. I'm not convinced by the removal of external parsing components yet, we could test it thoroughly later. If you're OK with the changes I see two remaining steps:
|
Results with new evaluation data in 544ce94: Current
This PR
→ Still a bit better. |
Hi @RadhiFadlillah, I'm now merging this PR in its current version, feel free to make further contributions! |
Overview
While porting this library into Go language, I've tried to made some improvements to make the extraction more accurate. After more testing, it looks like those improvements are good and stable enough to use so I decided to implement those improvements back to Python here.
Changes
There are three main changes in this PR:
Add French and Indonesian language to regular expressions that used to parse long date string.
This is done to fix
htmldate
failed to extract date fromparis-luttes.info.html
which uses French language. Since I added a new language to the regular expressions, I decided to add Indonesian language as well.Improve
custom_parse
.Now it works by trying to parse the string using several formats with following priority:
Merge xpath selectors from array of strings into a single string.
This is done to fix
htmldate
extracted the wrong date forwolfsrebellen-netz.forumieren.com.regeln.html
. Consider HTML document like this:In document above, there are two dates: one in element with class
"author"
and the other in element with class"current-time"
.In the original code,
htmldate
will pick the date from element in"current-time"
even though it's occured later in the document. This is because currentlyDATE_EXPRESSIONS
is created as array of Xpath selectors, and in that array element with classes that containstime
is given more priority than element with classes that containsauthor
.To fix this, I've converted
DATE_EXPRESSIONS
and other Xpath selectors from array of strings into a single string. This way every rules inside the expressions has same priority, so now the<p class="author">
will be selected first.Result
Here is the result of comparison test for the original
htmldate
:htmldate
fasthtmldate
extensiveAnd here is after this PR:
htmldate
fasthtmldate
extensiveSo there is a slight increase in accuracy, however the extraction speed become slower (around 1.5x slower than the original).
Additional Notes
I've not added it to this PR, however since
custom_parse
has been improved, from what I test we can safely removeexternal_date_parser
without any performance loss. Here is the result of comparison test afterexternal_date_parser
removed:htmldate
fasthtmldate
extensiveSo the accuracy is still the same, however the extraction speed for extensive mode become a lot faster (now only 1.08x slower than the fast mode) so we might be able to make the extensive mode as default. Might need more tests though.