Skip to content

Commit

Permalink
refactor evaluation and review tests (#151)
Browse files Browse the repository at this point in the history
* revamp testsand evaluation

* fix file input

* refactor and simplify evaluation

* update docs

* fix typo

* fix formatting
  • Loading branch information
adbar committed Jun 5, 2024
1 parent 9fb989d commit cf1c1ae
Show file tree
Hide file tree
Showing 7 changed files with 303 additions and 344 deletions.
8 changes: 6 additions & 2 deletions docs/evaluation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,11 @@ Evaluation
==========


Although text is ubiquitous on the Web, extracting information from web pages can prove to be difficult. In most cases, immediately accessible data on retrieved webpages do not carry substantial or accurate information: neither the URL nor the server response provide a reliable way to date a web document, that is find when it was written or modified. Content extraction mostly draws on Document Object Model (DOM) examination, that is on considering a given HTML document as a tree structure whose nodes represent parts of the document to be operated on. Less thorough and not necessarily faster alternatives use superficial search patterns such as regular expressions in order to capture desirable excerpts.
Although text is ubiquitous on the Web, extracting information from web pages can be a difficult task. Easily accessible data often lacks substance or accuracy. Specifically, the URL and server response do not provide a reliable way to determine when a web document was written or last modified.

To overcome this challenge, content extraction typically involves the Document Object Model (DOM) of an HTML document. This approach treats the document as a tree structure, where each node represents a part of the document that can be operated on. While this method is thorough, there are alternative approaches using superficial search patterns, such as regular expressions, to capture specific text parts. However, these alternatives may not be as effective or efficient.

To run the evaluation, see `evaluation README <https://github.com/adbar/htmldate/blob/master/tests/README.rst>`_.


Alternatives
Expand Down Expand Up @@ -30,7 +34,7 @@ Description

**Evaluation**: only documents with dates that are clearly to be determined are considered for this benchmark. A given day is taken as unit of reference, meaning that results are converted to ``%Y-%m-%d`` format if necessary in order to make them comparable. The evaluation script is available on the project repository: `tests/comparison.py <https://github.com/adbar/htmldate/blob/master/tests/comparison.py>`_. To reproduce the tests just clone the repository, install all necessary packages and run the evaluation script with the data provided in the *tests* directory.

**Time**: the execution time (best of 3 tests) cannot be easily compared in all cases as some solutions perform a whole series of operations which are irrelevant to this task.
**Time**: the execution time cannot be easily compared in all cases as some solutions perform a whole series of operations which are irrelevant to this task.

**Errors:** *goose3*'s output isn't always meaningful and/or in a standardized format, these cases were discarded. *news-please* seems to have trouble with some encodings (e.g. in Chinese), in which case it leads to an exception.

Expand Down
25 changes: 22 additions & 3 deletions tests/README.rst
Original file line number Diff line number Diff line change
@@ -1,18 +1,37 @@
Evaluation
==========

This evaluation focuses on the challenge of determining the publication date of a web document. Easily accessible data often lacks substance or accuracy. Specifically, the URL and server response do not provide a reliable way to determine when a web document was written or last modified.


Sources
-------

Principles
^^^^^^^^^^

The benchmark is run on a collection of documents which are either typical for Internet articles (news outlets, blogs, including smaller ones) or non-standard and thus harder to process. For the sake of completeness documents in multiple languages have been added.

Only documents with dates that are clearly to be determined are considered for this benchmark. A single day is taken as unit of reference (YMD format).

For more information see the `evaluation documentation page <https://htmldate.readthedocs.io/en/latest/evaluation.html>`_.


Date-annotated HTML pages
^^^^^^^^^^^^^^^^^^^^^^^^^

- BBAW collection (multilingual): Adrien Barbaresi, Shiyang Chen, Lukas Kozmus.
- Additional English news pages: `Data Culture Group <https://dataculturegroup.org>`_ at Northeastern University.
- BBAW collection (multilingual with a focus on German): Adrien Barbaresi, Shiyang Chen, Lukas Kozmus.
- Additional news pages (worldwide): `Data Culture Group <https://dataculturegroup.org>`_ at Northeastern University.


Reproducing the evaluation
--------------------------

1. Install the packages specified in ``eval-requirements.txt``
2. Run the script ``comparison.py``
2. Run the script ``comparison.py`` (``--help`` for more options)


Hints:

- As different packages are installed it is recommended to create a virtual environment, for example with ``pyenv`` or ``venv``.
- Some packages are slow, to evaluate ``htmldate`` only run ``python3 comparison.py --small``.
Loading

0 comments on commit cf1c1ae

Please sign in to comment.