-
-
Notifications
You must be signed in to change notification settings - Fork 26
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
refactor evaluation and review tests (#151)
* revamp testsand evaluation * fix file input * refactor and simplify evaluation * update docs * fix typo * fix formatting
- Loading branch information
Showing
7 changed files
with
303 additions
and
344 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,18 +1,37 @@ | ||
Evaluation | ||
========== | ||
|
||
This evaluation focuses on the challenge of determining the publication date of a web document. Easily accessible data often lacks substance or accuracy. Specifically, the URL and server response do not provide a reliable way to determine when a web document was written or last modified. | ||
|
||
|
||
Sources | ||
------- | ||
|
||
Principles | ||
^^^^^^^^^^ | ||
|
||
The benchmark is run on a collection of documents which are either typical for Internet articles (news outlets, blogs, including smaller ones) or non-standard and thus harder to process. For the sake of completeness documents in multiple languages have been added. | ||
|
||
Only documents with dates that are clearly to be determined are considered for this benchmark. A single day is taken as unit of reference (YMD format). | ||
|
||
For more information see the `evaluation documentation page <https://htmldate.readthedocs.io/en/latest/evaluation.html>`_. | ||
|
||
|
||
Date-annotated HTML pages | ||
^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
|
||
- BBAW collection (multilingual): Adrien Barbaresi, Shiyang Chen, Lukas Kozmus. | ||
- Additional English news pages: `Data Culture Group <https://dataculturegroup.org>`_ at Northeastern University. | ||
- BBAW collection (multilingual with a focus on German): Adrien Barbaresi, Shiyang Chen, Lukas Kozmus. | ||
- Additional news pages (worldwide): `Data Culture Group <https://dataculturegroup.org>`_ at Northeastern University. | ||
|
||
|
||
Reproducing the evaluation | ||
-------------------------- | ||
|
||
1. Install the packages specified in ``eval-requirements.txt`` | ||
2. Run the script ``comparison.py`` | ||
2. Run the script ``comparison.py`` (``--help`` for more options) | ||
|
||
|
||
Hints: | ||
|
||
- As different packages are installed it is recommended to create a virtual environment, for example with ``pyenv`` or ``venv``. | ||
- Some packages are slow, to evaluate ``htmldate`` only run ``python3 comparison.py --small``. |
Oops, something went wrong.