ERRANT v2.1.0

chrisjbryant · Jan 9, 2020 · e1e6066 · e1e6066
1 parent 9901a97
commit e1e6066
Show file tree

Hide file tree

Showing 11 changed files with 452 additions and 114 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -2,6 +2,19 @@
 
 This log describes all the significant changes made to ERRANT since its release.
 
+## v2.1.0 (09-01-20)
+
+1. The character level cost in the sentence alignment function is now computed by the much faster [python-Levenshtein](https://pypi.org/project/python-Levenshtein/) library instead of python's native `difflib.SequenceMatcher`. This makes ERRANT 3x faster!
+
+2. Various minor updates:  
+* Updated the English wordlist.
+* Fixed a broken rule for classifying contraction errors.
+* Changed a condition in the calculation of transposition errors to be more intuitive.
+* Partially updated the ERRANT POS tag map to match the updated [Universal POS tag map](https://universaldependencies.org/tagset-conversion/en-penn-uposf.html). Specifically, EX now maps to PRON rather than ADV, LS maps to X rather than PUNCT, and CONJ has been renamed CCONJ. I did not change the mapping of RP from PART to ADP yet because this breaks several rules involving phrasal verbs.
+* Added an `errant.__version__` attribute.
+* Added a warning about using ERRANT with spaCy 2.
+* Tidied some code in the classifier.
+
 ## v2.0.0 (10-12-19)
 
 1. ERRANT has been significantly refactored to accommodate a new API (see README). It should now also be much easier to extend to other languages.

diff --git a/README.md b/README.md
@@ -1,4 +1,4 @@
-# ERRANT v2.0.0
+# ERRANT v2.1.0
 
 This repository contains the grammatical ERRor ANnotation Toolkit (ERRANT) described in:
 
@@ -37,17 +37,21 @@ source errant_env/bin/activate
 pip3 install errant
 python3 -m spacy download en
 ```
-This will create and activate a new python3 environment called `errant_env` in the current directory. `pip` will then install ERRANT, [spaCy v1.9.0](https://spacy.io/), [NLTK](http://www.nltk.org/) and spaCy's default English model in this environment. You can deactivate the environment at any time by running `deactivate`, but must remember to activate it again whenever you want to use ERRANT.  
+This will create and activate a new python3 environment called `errant_env` in the current directory. `pip` will then install ERRANT, [spaCy v1.9.0](https://spacy.io/), [NLTK](http://www.nltk.org/), [python-Levenshtein](https://pypi.org/project/python-Levenshtein/) and spaCy's default English model in this environment. You can deactivate the environment at any time by running `deactivate`, but must remember to activate it again whenever you want to use ERRANT.  
 
-**Note: ERRANT does not support spaCy 2 at this time**. spaCy 2 POS tags are slightly different from spaCy 1 POS tags and so ERRANT rules, which were designed for spaCy 1, may not always work with spaCy 2.  
-
-### BEA-2019 Shared Task
+#### BEA-2019 Shared Task
 
 ERRANT v2.0.0 was designed to be fully compatible with the [BEA-2019 Shared Task](https://www.cl.cam.ac.uk/research/nl/bea2019st/). If you want to directly compare against the results in the shared task, you should make sure to install ERRANT v2.0.0 as newer versions may produce slightly different scores.  
 ```
 pip3 install errant==2.0.0
 ```
 
+#### ERRANT and spaCy 2
+
+ERRANT was originally designed to work with spaCy v1.9.0 and so only officially supports this version. We nevertheless tested ERRANT v2.1.0 with spaCy v2.2.3 and found it to be **over 4x slower and ~2% less accurate**. 
+
+This is mainly because spaCy 2 uses a neural system to trade speed for accuracy (see the [official spaCy benchmarks](https://spacy.io/usage/facts-figures#spacy-models)), but also because some Universal POS tag mappings changed, and so certain ERRANT rules no longer worked as intended. Although we could offset the accuracy loss by modifying ERRANT rules for the new POS mappings, there is nothing we can do about the significant speed loss, and so do not recommend spaCy 2 with ERRANT at this time. 
+
 ## Source Install
 
 If you prefer to install ERRANT from source, you can instead run the following commands:
@@ -98,7 +102,7 @@ All these scripts also have additional advanced command line options which can b
 
 #### Runtime
 
-In terms of speed, ERRANT processes ~155 sents/sec in the fully automatic edit extraction and classification setting, but ~1000 sents/sec in the classification setting alone. These figures were calculated on an Intel Core i5-6600 @ 3.30GHz machine, but results will vary depending on how different/long the original and corrected sentences are.  
+In terms of speed, ERRANT processes ~500 sents/sec in the fully automatic edit extraction and classification setting, but ~1000 sents/sec in the classification setting alone. These figures were calculated on an Intel Core i5-6600 @ 3.30GHz machine, but results will vary depending on how different/long the original and corrected sentences are.  
 
 ## API
 
@@ -226,6 +230,10 @@ The error type string.
 `edit`.**to_m2**(id=0)  
 Format the edit for an output M2 file. `id` is the annotator id.	
 
+## Development for Other Languages
+
+If you want to develop ERRANT for other languages, you should mimic the `errant/en` directory structure. For example, ERRANT for French should import a merger from `errant.fr.merger` and a classifier from `errant.fr.classifier` that respectively have equivalent `get_rule_edits` and `classify` methods. You will also need to add `'fr'` to the list of supported languages in `errant/__init__.py`.
+
 # Contact
 
 If you have any questions, suggestions or bug reports, you can contact the authors at:  

diff --git a/errant/__init__.py b/errant/__init__.py
@@ -1,7 +1,11 @@
 from importlib import import_module
+import logging
 import spacy
 from errant.annotator import Annotator
 
+# ERRANT version
+__version__ = '2.1.0'
+
 # Load an ERRANT Annotator object for a given language
 def load(lang, nlp=None):
     # Make sure the language is supported
@@ -11,6 +15,10 @@ def load(lang, nlp=None):
 
     # Load spacy
     nlp = nlp or spacy.load(lang, disable=["ner"])
+    # Warning for spacy 2
+    if spacy.__version__[0] == "2":
+        logging.warning("ERRANT is 4x slower and 2% less accurate with spaCy 2. "
+            "We strongly recommend spaCy 1.9.0!")
 
     # Load language edit merger
     merger = import_module("errant.%s.merger" % lang)

diff --git a/errant/alignment.py b/errant/alignment.py
@@ -1,5 +1,5 @@
-from difflib import SequenceMatcher
 from itertools import groupby
+import Levenshtein
 import spacy.parts_of_speech as POS
 from errant.edit import Edit
 
@@ -62,7 +62,7 @@ def align(self, lev):
                         # Traverse the diagonal while there is not a Match.
                         k = 1
                         while i-k >= 0 and j-k >= 0 and \
-                                cost_matrix[i-k+1][j-k+1]-cost_matrix[i-k][j-k] > 0:
+                                cost_matrix[i-k+1][j-k+1] != cost_matrix[i-k][j-k]:
                             if sorted(o_low[i-k:i+1]) == sorted(c_low[j-k:j+1]):
                                 trans_cost = cost_matrix[i-k][j-k] + k
                                 break
@@ -94,7 +94,7 @@ def get_sub_cost(self, o, c):
         elif o.pos in self._open_pos and c.pos in self._open_pos: pos_cost = 0.25
         else: pos_cost = 0.5
         # Char cost
-        char_cost = 1-SequenceMatcher(None, o.text, c.text).ratio()
+        char_cost = 1-Levenshtein.ratio(o.text, c.text)
         # Combine the costs
         return lemma_cost + pos_cost + char_cost
 

diff --git a/errant/commands/parallel_to_m2.py b/errant/commands/parallel_to_m2.py
@@ -45,6 +45,9 @@ def main():
                         out_m2.write(edit.to_m2(cor_id)+"\n")
             # Write a newline when we have processed all corrections for each line
             out_m2.write("\n")
+
+#    pr.disable()
+#    pr.print_stats(sort="time")
 
 # Parse command line args
 def parse_args():