eval: review code, add guidelines and small benchmark (#542)

* eval: review code, add guidelines and small benchmark * edit readme * change file name in CONTRIBUTING.md
adbar · Apr 4, 2024 · fb3e174 · fb3e174
1 parent d9d47a7
commit fb3e174
Show file tree

Hide file tree

Showing 5 changed files with 478 additions and 81 deletions.
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -20,12 +20,12 @@ Here are some important resources:
 
 ## Submitting changes
 
-Please send a [GitHub Pull Request to trafilatura](https://github.com/adbar/trafilatura/pull/new/master) with a clear list of what you've done (read more about [pull requests](http://help.github.com/pull-requests/)).
+Please send a [GitHub Pull Request to trafilatura](https://github.com/adbar/trafilatura/pull/new/master) with a clear list of what you have done (read more about [pull requests](http://help.github.com/pull-requests/)).
 
 **Working on your first Pull Request?** See this tutorial: [How To Create a Pull Request on GitHub](https://www.digitalocean.com/community/tutorials/how-to-create-a-pull-request-on-github)
 
 
-A special thanks to all the [contributors](https://github.com/adbar/trafilatura/graphs/contributors) who have played a part in developing and enhancing Trafilatura.
+A special thanks to all the [contributors](https://github.com/adbar/trafilatura/graphs/contributors) who have played a part in Trafilatura.
 
 
 
@@ -34,7 +34,7 @@ A special thanks to all the [contributors](https://github.com/adbar/trafilatura/
 Here is how you can run the tests if you wish to correct the errors and further improve the code:
 
 - Run `pytest` from trafilatura's directory, or select a particular test suite, for example `realworld_tests.py`, and run `pytest realworld_tests.py` or simply `python3 realworld_tests.py`
-- Check how it performs on the benchmark in `tests/eval/` by running `tests/comparison.py`
+- Check how it performs on the benchmark in `tests/eval/` by running `tests/comparison_small.py`
 
 See also the [tests Readme](tests/README.rst) for more information on the evaluation.
 

diff --git a/tests/README.rst b/tests/README.rst
@@ -1,24 +1,63 @@
 Evaluation
 ==========
 
-Reproducing the evaluation
---------------------------
+Introduction
+^^^^^^^^^^^^
+
+Focus
+-----
+
+The multilingual evaluation features a wide array of different websites: news outlets, online magazines, blogs, government or company pages. Archived versions of the pages are sometimes used to test if the extraction is consistent through time.
+
+The benchmark focuses on decisive text parts, mostly at the beginning and the end of the main text where errors often happen. Other difficult segments throughout the document are chosen to enhance detection of false positives, and segments in particular sections (e.g. quotes or lists) are taken to see if all necessary parts of a document are present in the output.
+
+
+Caveats
+-------
+
+This type of evaluation does not probe for duplicate segments, but Trafilatura features a LRU cache for detection of duplicate text parts.
+
+It is not evaluated whether the extracted segments are in the right order, although they are generally few and far apart.
+
+These decisions are prompted by the need to find cost-efficient ways to define a gold standard and annotate a series of documents. More comprehensive evaluations are available, mostly focusing on English and/or a particular text type.
+
+
+Running the code
+^^^^^^^^^^^^^^^^
+
+The results and a list of comparable benchmarks are available on the `evaluation page of the docs <https://trafilatura.readthedocs.io/en/latest/evaluation.html>`_.
+
+
+Trafilatura evaluation
+----------------------
+
+The following allows for comparing changes made to Trafilatura, for example in a new version or pull request:
+
+1. Install Trafilatura
+2. Run the script ``comparison_small.py``
+
+
+Full evaluation
+---------------
+
+A comparison with similar software is run periodically. As the packages tend to evolve the script may not always be up-to-date and all packages may not be available. If that happens, commenting out the corresponding sections is the most efficient solution. Fixes to the file can be submitted as pull requests.
+
 
 1. Install the packages specified in ``eval-requirements.txt``
-2. Run the script ``comparison.py``
+2. Run the script ``comparison.py`` (some packages are slow, it can be a while)
 
 
 Sources
--------
+^^^^^^^
 
 Annotated HTML documents
-^^^^^^^^^^^^^^^^^^^^^^^^
+------------------------
 
 - BBAW collection (multilingual): Adrien Barbaresi, Lukas Kozmus, Lena Klink.
 - Polish news: `tsolewski <https://github.com/tsolewski/Text_extraction_comparison_PL>`_.
 
 HTML archives
-^^^^^^^^^^^^^
+-------------
 
 - Additional German news sites: diskursmonitor.de, courtesy of Jan Oliver Rüdiger.
 
diff --git a/tests/comparison.py b/tests/comparison.py
@@ -7,11 +7,6 @@
 import re
 import time
 
-from lxml import html  # etree
-
-#from lxml.html.clean import Cleaner
-#HTML_CLEANER = Cleaner()
-
 try:
     from cchardet import detect
 except ImportError:
@@ -43,7 +38,7 @@
     baseline = None
 from evaldata import EVAL_PAGES
 
-from trafilatura.utils import sanitize
+# from trafilatura.utils import sanitize
 
 ## TODO: time, best of 3
 
@@ -87,7 +82,7 @@ def load_document_string(filename):
     #if not os.path.isfile(mypath):
     #    mypath = os.path.join(TEST_DIR, 'additional', filename)
     try:
-        with open(mypath, 'r') as inputf:
+        with open(mypath, 'r', encoding="utf-8") as inputf:
             htmlstring = inputf.read()
     # encoding/windows fix for the tests
     except UnicodeDecodeError:
@@ -105,65 +100,9 @@ def load_document_string(filename):
     return htmlstring
 
 
-def run_baseline_2(htmlstring):
-    '''run bare text extraction within lxml'''
-    # binary/string as input tweak
-    try:
-        tree = html.fromstring(htmlstring)
-    except ValueError:
-        tree = html.fromstring(htmlstring.encode('utf8'))
-    result = None
-    # try json-ld
-    for elem in tree.xpath('//script[@type="application/ld+json"]'):
-        if elem.text and '"articleBody":' in elem.text:
-            mymatch = re.search(r'"articleBody":"(.+?)","', elem.text)
-            if mymatch:
-                result = mymatch.group(1)
-                result = result.replace('\\"', '"')
-                # result = trim(result)
-                break
-    if result is not None:
-        return result
-    #results = set()
-    resultlist = []
-    # iterate potentially relevant elements
-    for element in tree.iter('blockquote', 'code', 'p', 'pre', 'q'): # 'h1', 'h2', 'h3', 'h4', 'h5', 'h6'
-        #if element.tag in ('h1', 'h2', 'h3', 'h4', 'h5', 'h6'):
-        #    if not element.text or len(element.text) < 20:
-        #        continue
-        #    entry = element.text
-        #else:
-        entry = element.text_content()
-        #if entry not in results and len(entry) > 10:
-        resultlist.append(entry)
-        #results.add(entry)
-    # if nothing has been found
-    #if len(resultlist) < 1:
-    #    for element in tree.iter('b', 'em', 'i', 'strong'):
-    #        entry = element.text_content()
-    #        #if entry not in results: # and len(entry) > 15:
-    #        resultlist.append(entry)
-    #        #results.add(entry)
-    #if len(resultlist) == 0:
-    #    cleaned_tree = HTML_CLEANER.clean_html(tree)
-    #    for element in tree.iter('div'):
-    #        entry = element.text_content()
-            #if len(entry) > 15:
-    #        resultlist.append(entry)
-    #        #results.add(entry)
-    #print(len(resultlist))
-    result = '\n'.join(resultlist)
-    # result = sanitize(result)
-    # print(result)
-    return result
-
-
 def run_baseline(htmlstring):
     '''run bare text extraction within lxml'''
-    if baseline is not None:
-        _, result, _ = baseline(htmlstring)
-        return result
-    result = run_baseline_2(htmlstring)
+    _, result, _ = baseline(htmlstring)
     return result
 
 
@@ -379,7 +318,7 @@ def evaluate_result(result, item):
     #elif type(result) is not str:
     #    print('not str', item['file'])
     # examine
-    if result is not None and type(result) is str:
+    if result is not None and isinstance(result, str):
         # expected output
         for to_include in item['with']:
             if to_include in result: