Merge pull request #240 from chrislit/0.4.1

0.4.1 This isn't quite EVERYTHING that will go in 0.4.1, but a release is coming soon. Also need to get CI working across platforms.
chrislit · Jan 6, 2020 · 2b6b3ed · 2b6b3ed
2 parents 643512e + a604561
commit 2b6b3ed
Show file tree

Hide file tree

Showing 103 changed files with 8,894 additions and 4,082 deletions.
diff --git a/FAQ.rst b/FAQ.rst
@@ -0,0 +1,108 @@
+FAQ
+===
+
+
+Why is the library licensed under GPL3+? Can you change the license?
+--------------------------------------------------------------------
+
+GPL3 is the only license compatible with all of the various parts of
+Abydos that have been ported to Python from other languages. For example,
+the Beider-Morse Phonetic Matching algorithm implementation included in
+Abydos was ported from their reference implementation in PHP, which is
+itself licensed under GPL3.
+
+Accordingly, it's not possible to change to a different license without
+removing parts of the library. However, if you have a need for a specific
+part of the library and can't use GPL3+ code, contact us and we may be able
+to provide it separately or can give guidance on its underlying licensing
+status.
+
+What is the purpose of this library?
+------------------------------------
+
+A. Abydos is intended to facilitate any manner of string transformation and
+comparison might be useful for string matching or record linkage. The two
+most significant parts of the library are string distance/similarity
+measures and phonetic algorithms/string fingerprint algorithms, but a large
+collection of tokenizers, corpus classes, compression algorithms, &
+phonetics functions support these and afford greater customization.
+
+Can you add this new feature?
+-----------------------------
+
+Maybe. Open an issue at https://github.com/chrislit/abydos/issues and
+propose your new feature.
+
+Additional string distance/similarity measures,
+phonetic algorithms, string fingerprint algorithms, and string tokenizers
+will certainly be added if possible -- but it's helpful to point them
+out since we may not be aware of them.
+
+Can I contribute to the project?
+--------------------------------
+
+Absolutely. You can take on an unclaimed issue, report bugs, add new
+classes, or whatever piques your interest. You are welcome to open an
+issue at https://github.com/chrislit/abydos/issues proposing what you'd
+like to work on, or you can submit a pull request if you have something
+ready to contribute to the repository.
+
+Will you add Metaphone 3?
+-------------------------
+
+No. Although Lawrence Philips (author of Metaphone, Double Metaphone, and
+Metaphone 3) released Metaphone 3 version 2.1.3 under the BSD 3-clause
+license as part of Google Refine, which became OpenRefine
+(https://github.com/OpenRefine/OpenRefine/blob/master/main/src/com/google/refine/clustering/binning/Metaphone3.java),
+he doesn't want that code used for ports to other languages or used in any
+way outside of OpenRefine. In accordance with his wishes, no one has
+released Metaphone 3 ports to other languages or included it other
+libraries.
+
+Why have you included algorithm X when it is already a part of NLTK/SciPy/...?
+------------------------------------------------------------------------------
+
+Abydos is a collection of algorithms with common class & function
+interfaces and options. So, while NLTK has Levenshtein & Jaccard string
+similarity measures, they don't allow for tunable edit costs or using
+the tokenizer of your choice.
+
+Are there similar projects for languages other than Python?
+-----------------------------------------------------------
+
+Yes, there are libraries such as:
+
+    - Talisman_ for JavaScript
+    - Phonics_ for R (phonetic algorithms)
+    - stringmetric_ for Scala
+
+.. _Talisman: https://github.com/Yomguithereal/talisman
+.. _Phonics: https://github.com/howardjp/phonics
+.. _stringmetric: https://github.com/rockymadden/stringmetric
+
+What is the process for adding a new class to the library?
+----------------------------------------------------------
+
+The process of adding a new class follows roughly the following steps:
+
+    - Discover that a new (unimplemented) measure/algorithm/method exists
+    - Locate the original source of the algorithm (a journal article, a
+      reference implementation, etc.). And save the reference to it in
+      docs/abydos.bib.
+
+        - If the original source cannot be located for reference, use an
+          adequate secondary source and add its reference info to
+          docs/abydos.bib.
+
+    - Implement the class based on its description/reference implementation.
+    - Create a test class and add all examples and test cases from the
+      original source. Add other reliable test cases from other sources, if
+      they are available.
+    - Ensure that the class passes all test cases.
+    - Add test cases, as necessary, until test coverage reaches 100%, or as
+      close to 100% as possible.
+
+Are these really Frequently Asked Questions?
+--------------------------------------------
+
+No. Most of these questions have never been explicitly asked.
diff --git a/HISTORY.rst b/HISTORY.rst
@@ -21,6 +21,17 @@ Changes:
     - LIG3 similarity
     - Discounted Levenshtein
     - Relaxed Hamming
+    - String subsequence kernel (SSK) similarity
+    - Phonetic edit distance
+    - Henderson-Heron dissimilarity
+    - Raup-Crick similarity
+    - Millar's binomial deviance dissimilarity
+    - Morisita similarity
+    - Horn-Morisita similarity
+    - Clark's coefficient of divergence
+    - Chao's Jaccard similarity
+    - Chao's Dice similarity
+    - Cao's CY similarity (CYs) and dissimilarity (CYd)
 - Added the following fingerprint classes:
     - Taft's Consonant coding
     - Taft's Extract - letter list

diff --git a/Pipfile b/Pipfile
@@ -7,7 +7,6 @@ verify_ssl = true
 tox = "*"
 nose = "*"
 coverage = "*"
-scipy = "*"
 nltk = "*"
 syllabipy = "*"
 

diff --git a/abydos/distance/__init__.py b/abydos/distance/__init__.py
@@ -38,6 +38,7 @@
     - FlexMetric distance (:py:class:`.FlexMetric`)
     - BI-SIM similarity (:py:class:`.BISIM`)
     - Discounted Levenshtein distance (:py:class:`.DiscountedLevenshtein`)
+    - Phonetic edit distance (:py:class:`.PhoneticEditDistance`)
 
 Hamming distance (:py:class:`.Hamming`), Relaxed Hamming distance
 (:py:class:`.RelaxedHamming`), and the closely related Modified
@@ -91,8 +92,12 @@
     - Bennet's S correlation (:py:class:`.Bennet`)
     - Braun-Blanquet similarity (:py:class:`.BraunBlanquet`)
     - Canberra distance (:py:class:`.Canberra`)
+    - Cao similarity (:py:class:`.Cao`)
+    - Chao's Dice similarity (:py:class:`.ChaoDice`)
+    - Chao's Jaccard similarity (:py:class:`.ChaoJaccard`)
     - Chebyshev distance (:py:class:`.Chebyshev`)
     - Chord distance (:py:class:`.Chord`)
+    - Clark distance (:py:class:`.Clark`)
     - Clement similarity (:py:class:`.Clement`)
     - Cohen's Kappa similarity (:py:class:`.CohenKappa`)
     - Cole correlation (:py:class:`.Cole`)
@@ -139,6 +144,8 @@
     - Hassanat distance (:py:class:`.Hassanat`)
     - Hawkins & Dotson similarity (:py:class:`.HawkinsDotson`)
     - Hellinger distance (:py:class:`.Hellinger`)
+    - Henderson-Heron similarity (:py:class:`.HendersonHeron`)
+    - Horn-Morisita similarity (:py:class:`.HornMorisita`)
     - Hurlbert correlation (:py:class:`.Hurlbert`)
     - Jaccard similarity (:py:class:`.Jaccard`) &
       Tanimoto coefficient (:py:meth:`.Jaccard.tanimoto_coeff`)
@@ -167,6 +174,7 @@
     - Lorentzian distance (:py:class:`.Lorentzian`)
     - Maarel correlation (:py:class:`.Maarel`)
     - Manhattan distance (:py:class:`.Manhattan`)
+    - Morisita similarity (:py:class:`.Morisita`)
     - marking distance (:py:class:`.Marking`)
     - marking metric (:py:class:`.MarkingMetric`)
     - MASI similarity (:py:class:`.MASI`)
@@ -177,6 +185,7 @@
     - mean squared contingency correlation (:py:class:`.MSContingency`)
     - Michael similarity (:py:class:`.Michael`)
     - Michelet similarity (:py:class:`.Michelet`)
+    - Millar distance (:py:class:`.Millar`)
     - Minkowski distance (:py:class:`.Minkowski`)
     - Mountford similarity (:py:class:`.Mountford`)
     - Mutual Information similarity (:py:class:`.MutualInformation`)
@@ -189,6 +198,7 @@
     - Pearson's Phi correlation (:py:class:`.PearsonPhi`)
     - Peirce correlation (:py:class:`.Peirce`)
     - q-gram distance (:py:class:`.QGram`)
+    - Raup-Crick similarity (:py:class:`.RaupCrick`)
     - Rogers & Tanimoto similarity (:py:class:`.RogersTanimoto`)
     - Rogot & Goldberg similarity (:py:class:`.RogotGoldberg`)
     - Russell & Rao similarity (:py:class:`.RussellRao`)
@@ -333,6 +343,7 @@
     - Guth (:py:class:`.Guth`)
     - Victorian Panel Study (:py:class:`.VPS`)
     - LIG3 (:py:class:`.LIG3`)
+    - String subsequence kernel (SSK) (:py:class:`.SSK`)
 
 Most of the distance and similarity measures have ``sim`` and ``dist`` methods,
 which return a measure that is normalized to the range :math:`[0, 1]`. The
@@ -399,8 +410,12 @@
 from ._brainerd_robinson import BrainerdRobinson
 from ._braun_blanquet import BraunBlanquet
 from ._canberra import Canberra
+from ._cao import Cao
+from ._chao_dice import ChaoDice
+from ._chao_jaccard import ChaoJaccard
 from ._chebyshev import Chebyshev, chebyshev
 from ._chord import Chord
+from ._clark import Clark
 from ._clement import Clement
 from ._cohen_kappa import CohenKappa
 from ._cole import Cole
@@ -468,7 +483,9 @@
 from ._hassanat import Hassanat
 from ._hawkins_dotson import HawkinsDotson
 from ._hellinger import Hellinger
+from ._henderson_heron import HendersonHeron
 from ._higuera_mico import HigueraMico
+from ._horn_morisita import HornMorisita
 from ._hurlbert import Hurlbert
 from ._ident import Ident, dist_ident, sim_ident
 from ._inclusion import Inclusion
@@ -524,10 +541,12 @@
 from ._mcewen_michael import McEwenMichael
 from ._meta_levenshtein import MetaLevenshtein
 from ._michelet import Michelet
+from ._millar import Millar
 from ._minhash import MinHash
 from ._minkowski import Minkowski, dist_minkowski, minkowski, sim_minkowski
 from ._mlipns import MLIPNS, dist_mlipns, sim_mlipns
 from ._monge_elkan import MongeElkan, dist_monge_elkan, sim_monge_elkan
+from ._morisita import Morisita
 from ._mountford import Mountford
 from ._mra import MRA, dist_mra, mra_compare, sim_mra
 from ._ms_contingency import MSContingency
@@ -551,6 +570,7 @@
 from ._pearson_phi import PearsonPhi
 from ._peirce import Peirce
 from ._phonetic_distance import PhoneticDistance
+from ._phonetic_edit_distance import PhoneticEditDistance
 from ._positional_q_gram_dice import PositionalQGramDice
 from ._positional_q_gram_jaccard import PositionalQGramJaccard
 from ._positional_q_gram_overlap import PositionalQGramOverlap
@@ -564,6 +584,7 @@
     dist_ratcliff_obershelp,
     sim_ratcliff_obershelp,
 )
+from ._raup_crick import RaupCrick
 from ._rees_levenshtein import ReesLevenshtein
 from ._relaxed_hamming import RelaxedHamming
 from ._roberts import Roberts
@@ -593,6 +614,7 @@
 from ._sokal_sneath_iv import SokalSneathIV
 from ._sokal_sneath_v import SokalSneathV
 from ._sorgenfrei import Sorgenfrei
+from ._ssk import SSK
 from ._steffensen import Steffensen
 from ._stiles import Stiles
 from ._strcmp95 import Strcmp95, dist_strcmp95, sim_strcmp95
@@ -670,6 +692,7 @@
     'FlexMetric',
     'BISIM',
     'DiscountedLevenshtein',
+    'PhoneticEditDistance',
     'Hamming',
     'hamming',
     'dist_hamming',
@@ -715,9 +738,13 @@
     'Bennet',
     'BraunBlanquet',
     'Canberra',
+    'Cao',
+    'ChaoDice',
+    'ChaoJaccard',
     'Chebyshev',
     'chebyshev',
     'Chord',
+    'Clark',
     'Clement',
     'CohenKappa',
     'Cole',
@@ -771,6 +798,8 @@
     'Hassanat',
     'HawkinsDotson',
     'Hellinger',
+    'HendersonHeron',
+    'HornMorisita',
     'Hurlbert',
     'Jaccard',
     'dist_jaccard',
@@ -800,11 +829,13 @@
     'KulczynskiII',
     'Lorentzian',
     'Maarel',
+    'Morisita',
     'Manhattan',
     'manhattan',
     'dist_manhattan',
     'sim_manhattan',
     'Michelet',
+    'Millar',
     'Minkowski',
     'minkowski',
     'dist_minkowski',
@@ -828,6 +859,7 @@
     'PearsonPhi',
     'Peirce',
     'QGram',
+    'RaupCrick',
     'ReesLevenshtein',
     'RogersTanimoto',
     'RogotGoldberg',
@@ -1001,6 +1033,7 @@
     'Guth',
     'VPS',
     'LIG3',
+    'SSK',
 ]
 
 

diff --git a/abydos/distance/_aline.py b/abydos/distance/_aline.py
@@ -1241,7 +1241,44 @@ def __init__(
             self._phones = self.phones_kondrak
         self._normalizer = normalizer
 
-    def alignment(self, src, tar, score_only=False):
+    def alignment(self, src, tar):
+        """Return the top ALINE alignment of two strings.
+
+        The `top` ALINE alignment is the first alignment with the best score.
+        The purpose of this function is to have a single tuple as a return
+        value.
+
+        Parameters
+        ----------
+        src : str
+            Source string for comparison
+        tar : str
+            Target string for comparison
+
+        Returns
+        -------
+        tuple(float, str, str)
+            ALINE alignment and its score
+
+        Examples
+        --------
+        >>> cmp = ALINE()
+        >>> cmp.alignment('cat', 'hat')
+        (50.0, 'c ‖ a t ‖', 'h ‖ a t ‖')
+        >>> cmp.alignment('niall', 'neil')
+        (90.0, '‖ n i a ll ‖', '‖ n e i l  ‖')
+        >>> cmp.alignment('aluminum', 'catalan')
+        (81.5, '‖ a l u m ‖ inum', 'cat ‖ a l a n ‖')
+        >>> cmp.alignment('atcg', 'tagc')
+        (65.0, '‖ a t c ‖ g', 't ‖ a g c ‖')
+
+
+        .. versionadded:: 0.4.1
+
+        """
+        return self.alignments(src, tar)[0]
+
+    def alignments(self, src, tar, score_only=False):
         """Return the ALINE alignments of two strings.
 
         Parameters
@@ -1261,18 +1298,20 @@ def alignment(self, src, tar, score_only=False):
         Examples
         --------
         >>> cmp = ALINE()
-        >>> cmp.alignment('cat', 'hat')
+        >>> cmp.alignments('cat', 'hat')
         [(50.0, 'c ‖ a t ‖', 'h ‖ a t ‖')]
-        >>> cmp.alignment('niall', 'neil')
+        >>> cmp.alignments('niall', 'neil')
         [(90.0, '‖ n i a ll ‖', '‖ n e i l  ‖')]
-        >>> cmp.alignment('aluminum', 'catalan')
+        >>> cmp.alignments('aluminum', 'catalan')
         [(81.5, '‖ a l u m ‖ inum', 'cat ‖ a l a n ‖')]
-        >>> cmp.alignment('atcg', 'tagc')
+        >>> cmp.alignments('atcg', 'tagc')
         [(65.0, '‖ a t c ‖ g', 't ‖ a g c ‖'), (65.0, 'a ‖ tc - g ‖',
         '‖ t  a g ‖ c')]
 
 
         .. versionadded:: 0.4.0
+        .. versionchanged:: 0.4.1
+            Renamed from .alignment to .alignments
 
         """
 
@@ -1615,7 +1654,7 @@ def sim_score(self, src, tar):
         """
         if src == '' and tar == '':
             return 1.0
-        return self.alignment(src, tar, score_only=True)
+        return self.alignments(src, tar, score_only=True)
 
     def sim(self, src, tar):
         """Return the normalized ALINE similarity of two strings.