-
-
Notifications
You must be signed in to change notification settings - Fork 242
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
eval: review code, add guidelines and small benchmark (#542)
* eval: review code, add guidelines and small benchmark * edit readme * change file name in CONTRIBUTING.md
- Loading branch information
Showing
5 changed files
with
478 additions
and
81 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,24 +1,63 @@ | ||
Evaluation | ||
========== | ||
|
||
Reproducing the evaluation | ||
-------------------------- | ||
Introduction | ||
^^^^^^^^^^^^ | ||
|
||
Focus | ||
----- | ||
|
||
The multilingual evaluation features a wide array of different websites: news outlets, online magazines, blogs, government or company pages. Archived versions of the pages are sometimes used to test if the extraction is consistent through time. | ||
|
||
The benchmark focuses on decisive text parts, mostly at the beginning and the end of the main text where errors often happen. Other difficult segments throughout the document are chosen to enhance detection of false positives, and segments in particular sections (e.g. quotes or lists) are taken to see if all necessary parts of a document are present in the output. | ||
|
||
|
||
Caveats | ||
------- | ||
|
||
This type of evaluation does not probe for duplicate segments, but Trafilatura features a LRU cache for detection of duplicate text parts. | ||
|
||
It is not evaluated whether the extracted segments are in the right order, although they are generally few and far apart. | ||
|
||
These decisions are prompted by the need to find cost-efficient ways to define a gold standard and annotate a series of documents. More comprehensive evaluations are available, mostly focusing on English and/or a particular text type. | ||
|
||
|
||
Running the code | ||
^^^^^^^^^^^^^^^^ | ||
|
||
The results and a list of comparable benchmarks are available on the `evaluation page of the docs <https://trafilatura.readthedocs.io/en/latest/evaluation.html>`_. | ||
|
||
|
||
Trafilatura evaluation | ||
---------------------- | ||
|
||
The following allows for comparing changes made to Trafilatura, for example in a new version or pull request: | ||
|
||
1. Install Trafilatura | ||
2. Run the script ``comparison_small.py`` | ||
|
||
|
||
Full evaluation | ||
--------------- | ||
|
||
A comparison with similar software is run periodically. As the packages tend to evolve the script may not always be up-to-date and all packages may not be available. If that happens, commenting out the corresponding sections is the most efficient solution. Fixes to the file can be submitted as pull requests. | ||
|
||
|
||
1. Install the packages specified in ``eval-requirements.txt`` | ||
2. Run the script ``comparison.py`` | ||
2. Run the script ``comparison.py`` (some packages are slow, it can be a while) | ||
|
||
|
||
Sources | ||
------- | ||
^^^^^^^ | ||
|
||
Annotated HTML documents | ||
^^^^^^^^^^^^^^^^^^^^^^^^ | ||
------------------------ | ||
|
||
- BBAW collection (multilingual): Adrien Barbaresi, Lukas Kozmus, Lena Klink. | ||
- Polish news: `tsolewski <https://github.com/tsolewski/Text_extraction_comparison_PL>`_. | ||
|
||
HTML archives | ||
^^^^^^^^^^^^^ | ||
------------- | ||
|
||
- Additional German news sites: diskursmonitor.de, courtesy of Jan Oliver Rüdiger. | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.