Skip to content

Commit

Permalink
docs: update evaluation and readme
Browse files Browse the repository at this point in the history
  • Loading branch information
adbar committed Jul 22, 2021
1 parent 72a0ddd commit e5ef7e0
Show file tree
Hide file tree
Showing 3 changed files with 47 additions and 19 deletions.
21 changes: 10 additions & 11 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -69,39 +69,38 @@ Evaluation and alternatives
For more detailed results see the `evaluation page <https://github.com/adbar/trafilatura/blob/master/docs/evaluation.rst>`_ and `evaluation script <https://github.com/adbar/trafilatura/blob/master/tests/comparison.py>`_. To reproduce the tests just clone the repository, install all necessary packages and run the evaluation script with the data provided in the *tests* directory.

=============================== ========= ========== ========= ========= ======
500 documents, 1487 text and 1496 boilerplate segments (2020-11-06)
500 documents, 1487 text and 1496 boilerplate segments (2020-06-07)
--------------------------------------------------------------------------------
Python Package Precision Recall Accuracy F-Score Diff.
=============================== ========= ========== ========= ========= ======
justext 2.2.0 (tweaked) 0.870 0.584 0.749 0.699 6.1x
justext 2.2.0 (custom) 0.870 0.584 0.749 0.699 6.1x
newspaper3k 0.2.8 0.921 0.574 0.763 0.708 12.9x
goose3 3.1.6 **0.950** 0.629 0.799 0.757 19.0x
boilerpy3 1.0.2 (article mode) 0.851 0.696 0.788 0.766 4.8x
goose3 3.1.9 **0.950** 0.644 0.806 0.767 18.8x
*baseline (text markup)* 0.746 0.804 0.766 0.774 **1x**
dragnet 2.0.4 0.906 0.689 0.810 0.783 3.1x
readability-lxml 0.8.1 0.917 0.716 0.826 0.804 5.9x
news-please 1.5.13 0.923 0.711 0.827 0.804 184x
trafilatura 0.6.0 0.924 0.849 0.890 0.885 3.9x
trafilatura 0.6.0 (+ fallbacks) 0.933 **0.877** **0.907** **0.904** 8.4x
news-please 1.5.21 0.924 0.718 0.830 0.808 60x
trafilatura 0.8.2 (fast) 0.925 0.868 0.899 0.896 3.9x
trafilatura 0.8.2 0.934 **0.890** **0.914** **0.912** 8.4x
=============================== ========= ========== ========= ========= ======

**External evaluations:**

- Most efficient open-source library in *ScrapingHub*'s `article extraction benchmark <https://github.com/scrapinghub/article-extraction-benchmark>`_ as well as in `another independant evaluation on the same data <https://github.com/currentsapi/extractnet>`_.
- Best overall tool according to Gaël Lejeune & Adrien Barbaresi, `Bien choisir son outil d'extraction de contenu à partir du Web <https://hal.archives-ouvertes.fr/hal-02768510v3/document>`_ (2020, PDF, French).
- `Evaluation page <https://trafilatura.readthedocs.io/en/latest/evaluation.html>`_ in the docs.


Usage and documentation
-----------------------

For further information please refer to the `documentation <https://trafilatura.readthedocs.io>`_:
For more information please refer to the docs:

- `Installation <https://trafilatura.readthedocs.io/en/latest/installation.html>`_
- Usage: `On the command-line <https://trafilatura.readthedocs.io/en/latest/usage-cli.html>`_, `With Python <https://trafilatura.readthedocs.io/en/latest/usage-python.html>`_, `With R <https://trafilatura.readthedocs.io/en/latest/usage-r.html>`_
- `Core Python functions <https://trafilatura.readthedocs.io/en/latest/corefunctions.html>`_
- Python Notebook `Trafilatura Overview <Trafilatura_Overview.ipynb>`_
- `Tutorials <https://trafilatura.readthedocs.io/en/latest/tutorials.html>`_
- `Evaluation <https://trafilatura.readthedocs.io/en/latest/evaluation.html>`_

For video tutorials see this Youtube playlist:

Expand All @@ -119,9 +118,9 @@ See also `GPL and free software licensing: What's in it for business? <https://w
Contributing
~~~~~~~~~~~~

`Contributions <https://github.com/adbar/trafilatura/blob/master/CONTRIBUTING.md>`_ are welcome!
`Contributions <https://github.com/adbar/trafilatura/blob/master/CONTRIBUTING.md>`_ are welcome! Please also feel free to file issues on the `dedicated page <https://github.com/adbar/trafilatura/issues>`_.

Feel free to file issues on the `dedicated page <https://github.com/adbar/trafilatura/issues>`_. Thanks to the `contributors <https://github.com/adbar/trafilatura/graphs/contributors>`_ who submitted features and bugfixes!
Many thanks to the `contributors <https://github.com/adbar/trafilatura/graphs/contributors>`_ who submitted features and bugfixes!


Author
Expand Down
38 changes: 32 additions & 6 deletions docs/evaluation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -57,7 +57,7 @@ Description
The evaluation script is available on the project repository: `tests/comparison.py <https://github.com/adbar/trafilatura/blob/master/tests/comparison.py>`_. To reproduce the tests just clone the repository, install all necessary packages and run the evaluation script with the data provided in the *tests* directory.


Results (2020-11-06)
Results (2021-06-07)
--------------------

=============================== ========= ========== ========= ========= ======
Expand All @@ -69,16 +69,16 @@ Python Package Precision Recall Accuracy F-Score Diff.
html2text 2020.1.16 0.488 0.714 0.484 0.580 8.9x
html_text 0.5.2 0.526 **0.958** 0.548 0.679 1.9x
inscriptis 1.1 (html to txt) 0.531 **0.958** 0.556 0.683 2.4x
justext 2.2.0 (tweaked) 0.870 0.584 0.749 0.699 6.1x
justext 2.2.0 (custom) 0.870 0.584 0.749 0.699 6.1x
newspaper3k 0.2.8 0.921 0.574 0.763 0.708 12.9x
goose3 3.1.6 **0.950** 0.629 0.799 0.757 19.0x
boilerpy3 1.0.2 (article mode) 0.851 0.696 0.788 0.766 4.8x
goose3 3.1.9 **0.950** 0.644 0.806 0.767 18.8x
*baseline (text markup)* 0.746 0.804 0.766 0.774 **1x**
dragnet 2.0.4 0.906 0.689 0.810 0.783 3.1x
readability-lxml 0.8.1 0.917 0.716 0.826 0.804 5.9x
news-please 1.5.13 0.923 0.711 0.827 0.804 184x
trafilatura 0.6.0 0.924 0.849 0.890 0.885 3.9x
trafilatura 0.6.0 (+ fallbacks) 0.933 **0.877** **0.907** **0.904** 8.4x
news-please 1.5.21 0.924 0.718 0.830 0.808 60x
trafilatura 0.8.2 (fast) 0.925 0.868 0.899 0.896 3.9x
trafilatura 0.8.2 0.934 **0.890** **0.914** **0.912** 8.4x
=============================== ========= ========== ========= ========= ======


Expand All @@ -91,6 +91,32 @@ External evaluations



Older results (2020-11-06)
--------------------------

=============================== ========= ========== ========= ========= ======
500 documents, 1487 text and 1496 boilerplate segments
--------------------------------------------------------------------------------
Python Package Precision Recall Accuracy F-Score Diff.
=============================== ========= ========== ========= ========= ======
*raw HTML* 0.527 0.878 0.547 0.659 0
html2text 2020.1.16 0.488 0.714 0.484 0.580 8.9x
html_text 0.5.2 0.526 **0.958** 0.548 0.679 1.9x
inscriptis 1.1 (html to txt) 0.531 **0.958** 0.556 0.683 2.4x
justext 2.2.0 (tweaked) 0.870 0.584 0.749 0.699 6.1x
newspaper3k 0.2.8 0.921 0.574 0.763 0.708 12.9x
goose3 3.1.6 **0.950** 0.629 0.799 0.757 19.0x
boilerpy3 1.0.2 (article mode) 0.851 0.696 0.788 0.766 4.8x
*baseline (text markup)* 0.746 0.804 0.766 0.774 **1x**
dragnet 2.0.4 0.906 0.689 0.810 0.783 3.1x
readability-lxml 0.8.1 0.917 0.716 0.826 0.804 5.9x
news-please 1.5.13 0.923 0.711 0.827 0.804 184x
trafilatura 0.6.0 0.924 0.849 0.890 0.885 3.9x
trafilatura 0.6.0 (+ fallbacks) 0.933 **0.877** **0.907** **0.904** 8.4x
=============================== ========= ========== ========= ========= ======



Older results (2020-07-16)
--------------------------

Expand Down
7 changes: 5 additions & 2 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -136,6 +136,8 @@ Corresponding posts on `Bits of Language <https://adrien.barbaresi.eu/blog/tag/t
Roadmap
~~~~~~~

For a comprehensive list see `issues page <https://github.com/adbar/trafilatura/issues>`_.

- [-] Duplicate detection at sentence, paragraph and document level using a least recently used (LRU) cache
- [-] URL lists and document management
- [-] Configuration and extraction parameters
Expand All @@ -147,9 +149,9 @@ Roadmap
Contributing
~~~~~~~~~~~~

`Contributions <https://github.com/adbar/trafilatura/blob/master/CONTRIBUTING.md>`_ are welcome!
`Contributions <https://github.com/adbar/trafilatura/blob/master/CONTRIBUTING.md>`_ are welcome! Please also feel free to file issues on the `dedicated page <https://github.com/adbar/trafilatura/issues>`_.

Feel free to file issues on the `dedicated page <https://github.com/adbar/trafilatura/issues>`_. Thanks to the `contributors <https://github.com/adbar/trafilatura/graphs/contributors>`_ who submitted features and bugfixes!
Many thanks to the `contributors <https://github.com/adbar/trafilatura/graphs/contributors>`_ who submitted features and bugfixes!


Author
Expand All @@ -160,6 +162,7 @@ This effort is part of methods to derive information from web documents in order
.. image:: https://zenodo.org/badge/DOI/10.5281/zenodo.3460969.svg
:target: https://doi.org/10.5281/zenodo.3460969

- Barbaresi, A. *Trafilatura: A Web Scraping Library and Command-Line Tool for Text Discovery and Extraction*, Proceedings of the Annual Meeting of the ACL, System Demonstrations, 2021, to appear.
- Barbaresi, A. "`Generic Web Content Extraction with Open-Source Software <https://hal.archives-ouvertes.fr/hal-02447264/document>`_", Proceedings of KONVENS 2019, Kaleidoscope Abstracts, 2019.
- Barbaresi, A. "`Efficient construction of metadata-enhanced web corpora <https://hal.archives-ouvertes.fr/hal-01371704v2/document>`_", Proceedings of the `10th Web as Corpus Workshop (WAC-X) <https://www.sigwac.org.uk/wiki/WAC-X>`_, 2016.

Expand Down

0 comments on commit e5ef7e0

Please sign in to comment.