Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Port improvements from go-htmldate #30

Merged
merged 11 commits into from Aug 13, 2021
Merged

Port improvements from go-htmldate #30

merged 11 commits into from Aug 13, 2021

Conversation

RadhiFadlillah
Copy link
Contributor

@RadhiFadlillah RadhiFadlillah commented Jul 13, 2021

Overview

While porting this library into Go language, I've tried to made some improvements to make the extraction more accurate. After more testing, it looks like those improvements are good and stable enough to use so I decided to implement those improvements back to Python here.

Changes

There are three main changes in this PR:

  1. Add French and Indonesian language to regular expressions that used to parse long date string.

    This is done to fix htmldate failed to extract date from paris-luttes.info.html which uses French language. Since I added a new language to the regular expressions, I decided to add Indonesian language as well.

  2. Improve custom_parse.

    Now it works by trying to parse the string using several formats with following priority:

    • YYYYMMDD pattern
    • YYYY-MM-DD (ISO-8601)
    • DD-MM-YYYY (most common used date format according to Wikipedia)
    • YYYY-MM pattern
    • Regex patterns
  3. Merge xpath selectors from array of strings into a single string.

    This is done to fix htmldate extracted the wrong date for wolfsrebellen-netz.forumieren.com.regeln.html. Consider HTML document like this:

    <div>
    	<h1>Some Title</h1>
    	<p class="author">By Joestar at 2020/12/12, 10:11 AM</p>
    	<p>Lorem ipsum dolor sit amet.</p>
    	<p>Dolorum explicabo quos voluptas voluptates?</p>
    	<p class="current-time">Current date and time: 2021/07/14, 09:00 PM</p>
    </div>

    In document above, there are two dates: one in element with class "author" and the other in element with class "current-time".

    In the original code, htmldate will pick the date from element in "current-time" even though it's occured later in the document. This is because currently DATE_EXPRESSIONS is created as array of Xpath selectors, and in that array element with classes that contains time is given more priority than element with classes that contains author.

    To fix this, I've converted DATE_EXPRESSIONS and other Xpath selectors from array of strings into a single string. This way every rules inside the expressions has same priority, so now the <p class="author"> will be selected first.

Result

Here is the result of comparison test for the original htmldate:

Package Precision Recall Accuracy F-Score Speed (s)
htmldate fast 0.899 0.917 0.831 0.908 1.038
htmldate extensive 0.893 1.000 0.893 0.944 2.219

And here is after this PR:

Package Precision Recall Accuracy F-Score Speed (s)
htmldate fast 0.920 0.938 0.867 0.929 1.579
htmldate extensive 0.911 1.000 0.911 0.953 2.807

So there is a slight increase in accuracy, however the extraction speed become slower (around 1.5x slower than the original).

Additional Notes

I've not added it to this PR, however since custom_parse has been improved, from what I test we can safely remove external_date_parser without any performance loss. Here is the result of comparison test after external_date_parser removed:

Package Precision Recall Accuracy F-Score Speed (s)
htmldate fast 0.920 0.938 0.867 0.929 1.678
htmldate extensive 0.911 1.000 0.911 0.953 1.816

So the accuracy is still the same, however the extraction speed for extensive mode become a lot faster (now only 1.08x slower than the fast mode) so we might be able to make the extensive mode as default. Might need more tests though.

@codecov-commenter
Copy link

codecov-commenter commented Jul 14, 2021

Codecov Report

Merging #30 (544ce94) into master (d6d34d3) will not change coverage.
The diff coverage is 100.00%.

❗ Current head 544ce94 differs from pull request most recent head 60bd2bc. Consider uploading reports for the commit 60bd2bc to get more accurate results
Impacted file tree graph

@@           Coverage Diff           @@
##           master      #30   +/-   ##
=======================================
  Coverage   92.36%   92.36%           
=======================================
  Files           7        7           
  Lines         943      943           
=======================================
  Hits          871      871           
  Misses         72       72           
Impacted Files Coverage Δ
htmldate/core.py 95.75% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d6d34d3...60bd2bc. Read the comment docs.

@RadhiFadlillah RadhiFadlillah marked this pull request as ready for review July 14, 2021 03:22
htmldate/extractors.py Outdated Show resolved Hide resolved
htmldate/extractors.py Outdated Show resolved Hide resolved
htmldate/extractors.py Outdated Show resolved Hide resolved
htmldate/extractors.py Outdated Show resolved Hide resolved
htmldate/settings.py Outdated Show resolved Hide resolved
@@ -386,8 +386,8 @@ def test_try_ymd_date():
assert try_ymd_date('Am 1. September 2017 um 15:36 Uhr schrieb', OUTPUTFORMAT, True, MIN_DATE, LATEST_POSSIBLE) == '2017-09-01'
assert try_ymd_date('Fri - September 1 - 2017', OUTPUTFORMAT, True, MIN_DATE, LATEST_POSSIBLE) == '2017-09-01'
assert try_ymd_date('1.9.2017', OUTPUTFORMAT, True, MIN_DATE, LATEST_POSSIBLE) == '2017-09-01'
assert try_ymd_date('1/9/17', OUTPUTFORMAT, True, MIN_DATE, LATEST_POSSIBLE) == '2017-01-09' # assuming MDY format
assert try_ymd_date('201709011234', OUTPUTFORMAT, True, MIN_DATE, LATEST_POSSIBLE) == '2017-09-01'
assert try_ymd_date('1/9/17', OUTPUTFORMAT, True, MIN_DATE, LATEST_POSSIBLE) == '2017-09-01' # assuming MDY format
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's tricky...

@adbar
Copy link
Owner

adbar commented Jul 15, 2021

Hi @RadhiFadlillah, thank you very much for the improvements!

There is still some work to do on the PR to make it more pythonic, especially the use of external Python modules in custom_parse should stay as it is IMHO, mostly for speed reasons. 1.5x slower is not huge but it's still significant.

Besides, we have a good opportunity with PR #29 to test the changes on a more global dataset, I would want to do it before merging.

@adbar
Copy link
Owner

adbar commented Aug 3, 2021

First experiments on the new (combined dataset):

Current

Package Precision Recall Accuracy F-score Time (wrt htmldate fast)
htmldate extensive 0.831 0.989 0.824 0.903226 2.03x
htmldate fast 0.838 0.903 0.769 0.869681 1.00x

This PR

Package Precision Recall Accuracy F-score Time (wrt htmldate fast)
htmldate extensive 0.836 0.989 0.828 0.906049 1.78x
htmldate fast 0.851 0.905 0.781 0.877147 1.00x

→ Small improvement, but also slightly slower.

@adbar
Copy link
Owner

adbar commented Aug 4, 2021

Hi @RadhiFadlillah, I worked on the PR and used the new benchmark. It seems we still have a bit work to do.

I'm not convinced by the removal of external parsing components yet, we could test it thoroughly later.

If you're OK with the changes I see two remaining steps:

  • The numbers in the Readme need to be adapted to the new benchmark
  • Smaller parts of the code are untested (see codecov diff)

@adbar
Copy link
Owner

adbar commented Aug 5, 2021

Results with new evaluation data in 544ce94:

Current

Package Precision Recall Accuracy F-score Time (wrt fast mode)
htmldate extensive 0.819 0.990 0.812 0.896247 1.80x
htmldate fast 0.826 0.905 0.760 0.863636 1.00x

This PR

Package Precision Recall Accuracy F-score Time (wrt fast mode)
htmldate extensive 0.825 0.990 0.818 0.89989 1.59x
htmldate fast 0.839 0.906 0.772 0.871332 1.00x

→ Still a bit better.

@adbar
Copy link
Owner

adbar commented Aug 13, 2021

Hi @RadhiFadlillah, I'm now merging this PR in its current version, feel free to make further contributions!

@adbar adbar merged commit e325529 into adbar:master Aug 13, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants