Skip to content

Commit

Permalink
Merge pull request #240 from chrislit/0.4.1
Browse files Browse the repository at this point in the history
0.4.1
This isn't quite EVERYTHING that will go in 0.4.1, but a release is coming soon. Also need to get CI working across platforms.
  • Loading branch information
chrislit committed Jan 6, 2020
2 parents 643512e + a604561 commit 2b6b3ed
Show file tree
Hide file tree
Showing 103 changed files with 8,894 additions and 4,082 deletions.
108 changes: 108 additions & 0 deletions FAQ.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,108 @@
FAQ
===


Why is the library licensed under GPL3+? Can you change the license?
--------------------------------------------------------------------

GPL3 is the only license compatible with all of the various parts of
Abydos that have been ported to Python from other languages. For example,
the Beider-Morse Phonetic Matching algorithm implementation included in
Abydos was ported from their reference implementation in PHP, which is
itself licensed under GPL3.

Accordingly, it's not possible to change to a different license without
removing parts of the library. However, if you have a need for a specific
part of the library and can't use GPL3+ code, contact us and we may be able
to provide it separately or can give guidance on its underlying licensing
status.

What is the purpose of this library?
------------------------------------

A. Abydos is intended to facilitate any manner of string transformation and
comparison might be useful for string matching or record linkage. The two
most significant parts of the library are string distance/similarity
measures and phonetic algorithms/string fingerprint algorithms, but a large
collection of tokenizers, corpus classes, compression algorithms, &
phonetics functions support these and afford greater customization.

Can you add this new feature?
-----------------------------

Maybe. Open an issue at https://github.com/chrislit/abydos/issues and
propose your new feature.

Additional string distance/similarity measures,
phonetic algorithms, string fingerprint algorithms, and string tokenizers
will certainly be added if possible -- but it's helpful to point them
out since we may not be aware of them.

Can I contribute to the project?
--------------------------------

Absolutely. You can take on an unclaimed issue, report bugs, add new
classes, or whatever piques your interest. You are welcome to open an
issue at https://github.com/chrislit/abydos/issues proposing what you'd
like to work on, or you can submit a pull request if you have something
ready to contribute to the repository.

Will you add Metaphone 3?
-------------------------

No. Although Lawrence Philips (author of Metaphone, Double Metaphone, and
Metaphone 3) released Metaphone 3 version 2.1.3 under the BSD 3-clause
license as part of Google Refine, which became OpenRefine
(https://github.com/OpenRefine/OpenRefine/blob/master/main/src/com/google/refine/clustering/binning/Metaphone3.java),
he doesn't want that code used for ports to other languages or used in any
way outside of OpenRefine. In accordance with his wishes, no one has
released Metaphone 3 ports to other languages or included it other
libraries.

Why have you included algorithm X when it is already a part of NLTK/SciPy/...?
------------------------------------------------------------------------------

Abydos is a collection of algorithms with common class & function
interfaces and options. So, while NLTK has Levenshtein & Jaccard string
similarity measures, they don't allow for tunable edit costs or using
the tokenizer of your choice.

Are there similar projects for languages other than Python?
-----------------------------------------------------------

Yes, there are libraries such as:

- Talisman_ for JavaScript
- Phonics_ for R (phonetic algorithms)
- stringmetric_ for Scala

.. _Talisman: https://github.com/Yomguithereal/talisman
.. _Phonics: https://github.com/howardjp/phonics
.. _stringmetric: https://github.com/rockymadden/stringmetric

What is the process for adding a new class to the library?
----------------------------------------------------------

The process of adding a new class follows roughly the following steps:

- Discover that a new (unimplemented) measure/algorithm/method exists
- Locate the original source of the algorithm (a journal article, a
reference implementation, etc.). And save the reference to it in
docs/abydos.bib.

- If the original source cannot be located for reference, use an
adequate secondary source and add its reference info to
docs/abydos.bib.

- Implement the class based on its description/reference implementation.
- Create a test class and add all examples and test cases from the
original source. Add other reliable test cases from other sources, if
they are available.
- Ensure that the class passes all test cases.
- Add test cases, as necessary, until test coverage reaches 100%, or as
close to 100% as possible.

Are these really Frequently Asked Questions?
--------------------------------------------

No. Most of these questions have never been explicitly asked.
11 changes: 11 additions & 0 deletions HISTORY.rst
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,17 @@ Changes:
- LIG3 similarity
- Discounted Levenshtein
- Relaxed Hamming
- String subsequence kernel (SSK) similarity
- Phonetic edit distance
- Henderson-Heron dissimilarity
- Raup-Crick similarity
- Millar's binomial deviance dissimilarity
- Morisita similarity
- Horn-Morisita similarity
- Clark's coefficient of divergence
- Chao's Jaccard similarity
- Chao's Dice similarity
- Cao's CY similarity (CYs) and dissimilarity (CYd)
- Added the following fingerprint classes:
- Taft's Consonant coding
- Taft's Extract - letter list
Expand Down
1 change: 0 additions & 1 deletion Pipfile
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,6 @@ verify_ssl = true
tox = "*"
nose = "*"
coverage = "*"
scipy = "*"
nltk = "*"
syllabipy = "*"

Expand Down
33 changes: 33 additions & 0 deletions abydos/distance/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,7 @@
- FlexMetric distance (:py:class:`.FlexMetric`)
- BI-SIM similarity (:py:class:`.BISIM`)
- Discounted Levenshtein distance (:py:class:`.DiscountedLevenshtein`)
- Phonetic edit distance (:py:class:`.PhoneticEditDistance`)
Hamming distance (:py:class:`.Hamming`), Relaxed Hamming distance
(:py:class:`.RelaxedHamming`), and the closely related Modified
Expand Down Expand Up @@ -91,8 +92,12 @@
- Bennet's S correlation (:py:class:`.Bennet`)
- Braun-Blanquet similarity (:py:class:`.BraunBlanquet`)
- Canberra distance (:py:class:`.Canberra`)
- Cao similarity (:py:class:`.Cao`)
- Chao's Dice similarity (:py:class:`.ChaoDice`)
- Chao's Jaccard similarity (:py:class:`.ChaoJaccard`)
- Chebyshev distance (:py:class:`.Chebyshev`)
- Chord distance (:py:class:`.Chord`)
- Clark distance (:py:class:`.Clark`)
- Clement similarity (:py:class:`.Clement`)
- Cohen's Kappa similarity (:py:class:`.CohenKappa`)
- Cole correlation (:py:class:`.Cole`)
Expand Down Expand Up @@ -139,6 +144,8 @@
- Hassanat distance (:py:class:`.Hassanat`)
- Hawkins & Dotson similarity (:py:class:`.HawkinsDotson`)
- Hellinger distance (:py:class:`.Hellinger`)
- Henderson-Heron similarity (:py:class:`.HendersonHeron`)
- Horn-Morisita similarity (:py:class:`.HornMorisita`)
- Hurlbert correlation (:py:class:`.Hurlbert`)
- Jaccard similarity (:py:class:`.Jaccard`) &
Tanimoto coefficient (:py:meth:`.Jaccard.tanimoto_coeff`)
Expand Down Expand Up @@ -167,6 +174,7 @@
- Lorentzian distance (:py:class:`.Lorentzian`)
- Maarel correlation (:py:class:`.Maarel`)
- Manhattan distance (:py:class:`.Manhattan`)
- Morisita similarity (:py:class:`.Morisita`)
- marking distance (:py:class:`.Marking`)
- marking metric (:py:class:`.MarkingMetric`)
- MASI similarity (:py:class:`.MASI`)
Expand All @@ -177,6 +185,7 @@
- mean squared contingency correlation (:py:class:`.MSContingency`)
- Michael similarity (:py:class:`.Michael`)
- Michelet similarity (:py:class:`.Michelet`)
- Millar distance (:py:class:`.Millar`)
- Minkowski distance (:py:class:`.Minkowski`)
- Mountford similarity (:py:class:`.Mountford`)
- Mutual Information similarity (:py:class:`.MutualInformation`)
Expand All @@ -189,6 +198,7 @@
- Pearson's Phi correlation (:py:class:`.PearsonPhi`)
- Peirce correlation (:py:class:`.Peirce`)
- q-gram distance (:py:class:`.QGram`)
- Raup-Crick similarity (:py:class:`.RaupCrick`)
- Rogers & Tanimoto similarity (:py:class:`.RogersTanimoto`)
- Rogot & Goldberg similarity (:py:class:`.RogotGoldberg`)
- Russell & Rao similarity (:py:class:`.RussellRao`)
Expand Down Expand Up @@ -333,6 +343,7 @@
- Guth (:py:class:`.Guth`)
- Victorian Panel Study (:py:class:`.VPS`)
- LIG3 (:py:class:`.LIG3`)
- String subsequence kernel (SSK) (:py:class:`.SSK`)
Most of the distance and similarity measures have ``sim`` and ``dist`` methods,
which return a measure that is normalized to the range :math:`[0, 1]`. The
Expand Down Expand Up @@ -399,8 +410,12 @@
from ._brainerd_robinson import BrainerdRobinson
from ._braun_blanquet import BraunBlanquet
from ._canberra import Canberra
from ._cao import Cao
from ._chao_dice import ChaoDice
from ._chao_jaccard import ChaoJaccard
from ._chebyshev import Chebyshev, chebyshev
from ._chord import Chord
from ._clark import Clark
from ._clement import Clement
from ._cohen_kappa import CohenKappa
from ._cole import Cole
Expand Down Expand Up @@ -468,7 +483,9 @@
from ._hassanat import Hassanat
from ._hawkins_dotson import HawkinsDotson
from ._hellinger import Hellinger
from ._henderson_heron import HendersonHeron
from ._higuera_mico import HigueraMico
from ._horn_morisita import HornMorisita
from ._hurlbert import Hurlbert
from ._ident import Ident, dist_ident, sim_ident
from ._inclusion import Inclusion
Expand Down Expand Up @@ -524,10 +541,12 @@
from ._mcewen_michael import McEwenMichael
from ._meta_levenshtein import MetaLevenshtein
from ._michelet import Michelet
from ._millar import Millar
from ._minhash import MinHash
from ._minkowski import Minkowski, dist_minkowski, minkowski, sim_minkowski
from ._mlipns import MLIPNS, dist_mlipns, sim_mlipns
from ._monge_elkan import MongeElkan, dist_monge_elkan, sim_monge_elkan
from ._morisita import Morisita
from ._mountford import Mountford
from ._mra import MRA, dist_mra, mra_compare, sim_mra
from ._ms_contingency import MSContingency
Expand All @@ -551,6 +570,7 @@
from ._pearson_phi import PearsonPhi
from ._peirce import Peirce
from ._phonetic_distance import PhoneticDistance
from ._phonetic_edit_distance import PhoneticEditDistance
from ._positional_q_gram_dice import PositionalQGramDice
from ._positional_q_gram_jaccard import PositionalQGramJaccard
from ._positional_q_gram_overlap import PositionalQGramOverlap
Expand All @@ -564,6 +584,7 @@
dist_ratcliff_obershelp,
sim_ratcliff_obershelp,
)
from ._raup_crick import RaupCrick
from ._rees_levenshtein import ReesLevenshtein
from ._relaxed_hamming import RelaxedHamming
from ._roberts import Roberts
Expand Down Expand Up @@ -593,6 +614,7 @@
from ._sokal_sneath_iv import SokalSneathIV
from ._sokal_sneath_v import SokalSneathV
from ._sorgenfrei import Sorgenfrei
from ._ssk import SSK
from ._steffensen import Steffensen
from ._stiles import Stiles
from ._strcmp95 import Strcmp95, dist_strcmp95, sim_strcmp95
Expand Down Expand Up @@ -670,6 +692,7 @@
'FlexMetric',
'BISIM',
'DiscountedLevenshtein',
'PhoneticEditDistance',
'Hamming',
'hamming',
'dist_hamming',
Expand Down Expand Up @@ -715,9 +738,13 @@
'Bennet',
'BraunBlanquet',
'Canberra',
'Cao',
'ChaoDice',
'ChaoJaccard',
'Chebyshev',
'chebyshev',
'Chord',
'Clark',
'Clement',
'CohenKappa',
'Cole',
Expand Down Expand Up @@ -771,6 +798,8 @@
'Hassanat',
'HawkinsDotson',
'Hellinger',
'HendersonHeron',
'HornMorisita',
'Hurlbert',
'Jaccard',
'dist_jaccard',
Expand Down Expand Up @@ -800,11 +829,13 @@
'KulczynskiII',
'Lorentzian',
'Maarel',
'Morisita',
'Manhattan',
'manhattan',
'dist_manhattan',
'sim_manhattan',
'Michelet',
'Millar',
'Minkowski',
'minkowski',
'dist_minkowski',
Expand All @@ -828,6 +859,7 @@
'PearsonPhi',
'Peirce',
'QGram',
'RaupCrick',
'ReesLevenshtein',
'RogersTanimoto',
'RogotGoldberg',
Expand Down Expand Up @@ -1001,6 +1033,7 @@
'Guth',
'VPS',
'LIG3',
'SSK',
]


Expand Down
51 changes: 45 additions & 6 deletions abydos/distance/_aline.py
Original file line number Diff line number Diff line change
Expand Up @@ -1241,7 +1241,44 @@ def __init__(
self._phones = self.phones_kondrak
self._normalizer = normalizer

def alignment(self, src, tar, score_only=False):
def alignment(self, src, tar):
"""Return the top ALINE alignment of two strings.
The `top` ALINE alignment is the first alignment with the best score.
The purpose of this function is to have a single tuple as a return
value.
Parameters
----------
src : str
Source string for comparison
tar : str
Target string for comparison
Returns
-------
tuple(float, str, str)
ALINE alignment and its score
Examples
--------
>>> cmp = ALINE()
>>> cmp.alignment('cat', 'hat')
(50.0, 'c ‖ a t ‖', 'h ‖ a t ‖')
>>> cmp.alignment('niall', 'neil')
(90.0, '‖ n i a ll ‖', '‖ n e i l ‖')
>>> cmp.alignment('aluminum', 'catalan')
(81.5, '‖ a l u m ‖ inum', 'cat ‖ a l a n ‖')
>>> cmp.alignment('atcg', 'tagc')
(65.0, '‖ a t c ‖ g', 't ‖ a g c ‖')
.. versionadded:: 0.4.1
"""
return self.alignments(src, tar)[0]

def alignments(self, src, tar, score_only=False):
"""Return the ALINE alignments of two strings.
Parameters
Expand All @@ -1261,18 +1298,20 @@ def alignment(self, src, tar, score_only=False):
Examples
--------
>>> cmp = ALINE()
>>> cmp.alignment('cat', 'hat')
>>> cmp.alignments('cat', 'hat')
[(50.0, 'c ‖ a t ‖', 'h ‖ a t ‖')]
>>> cmp.alignment('niall', 'neil')
>>> cmp.alignments('niall', 'neil')
[(90.0, '‖ n i a ll ‖', '‖ n e i l ‖')]
>>> cmp.alignment('aluminum', 'catalan')
>>> cmp.alignments('aluminum', 'catalan')
[(81.5, '‖ a l u m ‖ inum', 'cat ‖ a l a n ‖')]
>>> cmp.alignment('atcg', 'tagc')
>>> cmp.alignments('atcg', 'tagc')
[(65.0, '‖ a t c ‖ g', 't ‖ a g c ‖'), (65.0, 'a ‖ tc - g ‖',
'‖ t a g ‖ c')]
.. versionadded:: 0.4.0
.. versionchanged:: 0.4.1
Renamed from .alignment to .alignments
"""

Expand Down Expand Up @@ -1615,7 +1654,7 @@ def sim_score(self, src, tar):
"""
if src == '' and tar == '':
return 1.0
return self.alignment(src, tar, score_only=True)
return self.alignments(src, tar, score_only=True)

def sim(self, src, tar):
"""Return the normalized ALINE similarity of two strings.
Expand Down
Loading

0 comments on commit 2b6b3ed

Please sign in to comment.