Merge 6ccc9dc into 2ce74a0

aparrish · Jan 13, 2020 · fae5c78 · fae5c78
2 parents 2ce74a0 + 6ccc9dc
commit fae5c78
Show file tree

Hide file tree

Showing 15 changed files with 479 additions and 109 deletions.
diff --git a/.travis.yml b/.travis.yml
@@ -2,8 +2,8 @@
 
 language: python
 python:
-  - "pypy3"
-  - "pypy"
+  - "3.8"
+  - "3.7"
   - "3.6"
   - "3.5"
   - "3.4"
@@ -14,12 +14,12 @@ sudo: false
 
 # command to install dependencies, e.g. pip install -r requirements.txt --use-mirrors
 install:
-  - pip install -r requirements.txt
+  - pip install -e .[dev]
   - pip install coverage
 
 # command to run tests, e.g. python setup.py test
 script:
-  - coverage run --source simpleneighbors setup.py test --verbose
+  - coverage run --source simpleneighbors tests/test_simpleneighbors.py --verbose
   - python -m doctest simpleneighbors/__init__.py
 
 after_success:
@@ -28,6 +28,4 @@ after_success:
 
 after_script:
 - coverage report                     # show coverage on cmd line
-- pip install pycodestyle pyflakes
-- pyflakes . | tee >(wc -l)           # static analysis
-- pycodestyle --statistics --count .  # static analysis
+- flake8 simpleneighbors tests
diff --git a/Makefile b/Makefile
@@ -7,7 +7,6 @@ help:
 	@echo "clean-test - remove test and coverage artifacts"
 	@echo "lint - check style with flake8"
 	@echo "test - run tests quickly with the default Python"
-	@echo "test-all - run tests on every Python version with tox"
 	@echo "coverage - check code coverage quickly with the default Python"
 	@echo "docs - generate Sphinx HTML documentation, including API docs"
 	@echo "release - package and upload a release"
@@ -30,7 +29,6 @@ clean-pyc:
 	find . -name '__pycache__' -exec rm -fr {} +
 
 clean-test:
-	rm -fr .tox/
 	rm -f .coverage
 	rm -fr htmlcov/
 
@@ -41,9 +39,6 @@ test:
 	python setup.py test
 	python -m doctest simpleneighbors/__init__.py
 
-test-all:
-	tox
-
 coverage:
 	coverage run --source simpleneighbors setup.py test
 	coverage report -m

diff --git a/README.rst b/README.rst
@@ -11,8 +11,12 @@ Simple Neighbors
         :target: https://pypi.python.org/pypi/simpleneighbors
 
 Simple Neighbors is a clean and easy interface for performing nearest-neighbor
-lookups on items from a corpus. For example, here's how to find the most
-similar color to a color in the `xkcd colors list
+lookups on items from a corpus. To install the package::
+
+    pip install simpleneighbors[annoy]
+
+Here's a quick example, showing how to find the names of colors most similar to
+'pink' in the `xkcd colors list
 <https://github.com/dariusk/corpora/blob/master/data/colors/xkcd.json>`_::
 
     >>> from simpleneighbors import SimpleNeighbors
@@ -26,7 +30,16 @@ similar color to a color in the `xkcd colors list
     >>> list(sim.neighbors('pink', 5))
     ['pink', 'bubblegum pink', 'pale magenta', 'dark mauve', 'light plum']
 
-Read the documentation here: https://simpleneighbors.readthedocs.org.
+For a more complete example, refer to my `Understanding Word Vectors notebook
+<https://github.com/aparrish/rwet/blob/master/understanding-word-vectors.ipynb>`_,
+which shows how to use Simple Neighbors to perform similarity lookups on word
+vectors.
+
+Read the complete Simple Neighbors documentation here:
+https://simpleneighbors.readthedocs.org.
+
+Why Simple Neighbors?
+---------------------
 
 Approximate nearest-neighbor lookups are a quick way to find the items in your
 data set that are closest (or most similar to) any other item in your data, or
@@ -36,28 +49,57 @@ in a 300-dimensional space.
 
 You could always perform pairwise distance calculations to find nearest
 neighbors in your data, but for data of any appreciable size and complexity,
-this kind of calculation is unbearably slow. This library uses `Annoy
-<https://pypi.org/project/annoy/>`_ behind the scenes for approximate
-nearest-neighbor lookups, which are ultimately a little less accurate than
-pairwise calculations but much, much faster.
+this kind of calculation is unbearably slow. Simple Neighbors uses one of a
+handful of libraries behind the scenes to provide approximate nearest-neighbor
+lookups, which are ultimately a little less accurate than pairwise calculations
+but much, much faster.
 
 The library also keeps track of your data, sparing you the extra step of
-mapping each item in your data to its integer index in Annoy (at the potential
-cost of some redundancy in data storage, depending on your application).
+mapping each item in your data to its integer index (at the potential cost of
+some redundancy in data storage, depending on your application).
+
+I made Simple Neighbors because I use nearest neighbor lookups all the time and
+found myself writing and rewriting the same bits of wrapper code over and over
+again. I wanted to hide a little bit of the complexity of using these libraries
+to make it easier to build small prototypes and teach workshops using
+nearest-neighbor lookups.
+
+Multiple backend support
+------------------------
+
+Simple Neighbors relies on the approximate nearest neighbor index
+implementations found in other libraries. By default, Simple Neighbors will
+choose the best backend based on the packages installed in your environment.
+(You can also specify which backend to use by hand, or create your own.)
+
+Currently supported backend libraries include:
+
+* ``Annoy``: Erik Bernhardsson's `Annoy <https://pypi.org/project/annoy/>`_ library
+* ``Sklearn``: `scikit-learn's NearestNeighbors <https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.NearestNeighbors.html#sklearn.neighbors.NearestNeighbors>`_
+* ``BruteForcePurePython``: Pure Python brute-force search (included in package)
+
+When you install Simple Neighbors, you can direct ``pip`` to install the
+required packages for a given backend. For example, to install Simple Neighbors
+with Annoy::
+
+    pip install simpleneighbors[annoy]
+
+Annoy is highly recommended! This is the preferred way to use Simple Neighbors.
 
-I made Simple Neighbors because I use Annoy all the time and found myself
-writing and rewriting the same bits of wrapper code over and over again. I
-wanted to hide a little bit of the complexity of using Annoy to make it easier
-to build small prototypes and teach workshops using nearest-neighbor lookups.
+To install Simple Neighbors alongside scikit-learn to use the ``Sklearn``
+backend (which makes use of scikit-learn's `NearestNeighbors` class)::
 
-Installation
-------------
+    pip install simpleneighbors[sklearn]
 
-Install with pip like so::
+If you can't install Annoy or scikit-learn on your platform, you can also use a
+pure Python backend::
 
-    pip install simpleneighbors
+    pip install simpleneighbors[purepython]
 
-You can also download the source code and install manually::
+Note that the pure Python version uses a brute force search and is therefore
+very slow. In general, it's not suitable for datasets with more than a few
+thousand items (or more than a handful of dimensions).
 
-    python setup.py install
+See the documentation for the ``SimpleNeighbors`` class for more information on
+specifying backends.
 
diff --git a/requirements.txt b/requirements.txt
diff --git a/setup.py b/setup.py
@@ -7,7 +7,7 @@
 
 setup(
     name='simpleneighbors',
-    version='0.0.1',
+    version='0.1.0',
     author='Allison Parrish',
     author_email='allison@decontextualize.com',
     url='https://github.com/aparrish/simpleneighbors',
@@ -26,8 +26,19 @@
     package_dir={'simpleneighbors': 'simpleneighbors'},
     packages=['simpleneighbors'],
     install_requires=[
-        'annoy'
     ],
+    extras_require={
+        'annoy': ['annoy>=1.16.0'],
+        'sklearn': ['scikit-learn>=0.20'],
+        'purepython': [],
+        'dev': [
+            'annoy>=1.16.0',
+            'scikit-learn>=0.20',
+            'mock;python_version<="2.7"',
+            'coverage',
+            'flake8',
+        ]
+    },
     platforms='any',
     test_suite='tests'
 )
diff --git a/simpleneighbors/__init__.py b/simpleneighbors/__init__.py
@@ -1,32 +1,44 @@
 import pickle
-import annoy
+from simpleneighbors.backends import select_best
 
 __author__ = 'Allison Parrish'
 __email__ = 'allison@decontextualize.com'
-__version__ = '0.0.1'
+__version__ = '0.1.0'
 
 
 class SimpleNeighbors:
     """A Simple Neighbors index.
 
-    You need to specify the number of dimensions in your data (i.e., the
-    length of the list or array you plan to provide for each item) and the
-    distance metric you want to use. (The default is "angular" distance,
-    i.e., cosine distance. You might also want to try "euclidean" for
-    Euclidean distance.) Both of these parameters are passed directly to
-    Annoy; see `the Annoy documentation <https://pypi.org/project/annoy/>`_
-    for more details.
+    This class wraps backend implementations of approximate nearest neighbors
+    indexes with a user-friendly API. When you instantiate this class, it will
+    automatically select a backend implementation based on packages installed
+    in your environment. It is HIGHLY RECOMMENDED that you install Annoy (``pip
+    install annoy``) to enable the Annoy backend! (The alternatives are
+    slower and not as accurate.) Alternatively, you can specify a backend of
+    your choosing with the ``backend`` parameter.
+
+    Specify the number of dimensions in your data (i.e., the length of the list
+    or array you plan to provide for each item) and the distance metric you
+    want to use. The default is ``angular`` distance, an approximation of
+    cosine distance. This metric is supported by all backends, as is
+    ``euclidean`` (for Euclidean distance). Both of these parameters are passed
+    directly to the backend; see the backend documentation for more details.
 
     :param dims: the number of dimensions in your data
     :param metric: the distance metric to use
+    :param backend: the nearest neighbors backend to use (default is annoy)
     """
 
-    def __init__(self, dims, metric="angular"):
+    def __init__(self, dims, metric="angular", backend=None):
+
+        if backend is None:
+            backend = select_best()
+
         self.dims = dims
         self.metric = metric
         self.id_map = {}
         self.corpus = []
-        self.annoy = annoy.AnnoyIndex(dims, metric=metric)
+        self.backend = backend(dims, metric=metric)
         self.i = 0
         self.built = False
 
@@ -53,7 +65,7 @@ def add_one(self, item, vector):
         """
 
         assert self.built is False, "Index already built; can't add new items."
-        self.annoy.add_item(self.i, vector)
+        self.backend.add_item(self.i, vector)
         self.id_map[item] = self.i
         self.corpus.append(item)
         self.i += 1
@@ -88,20 +100,25 @@ def feed(self, items):
         for item, vector in items:
             self.add_one(item, vector)
 
-    def build(self, n=10):
+    def build(self, n=10, params=None):
         """Build the index.
 
-        After adding all of your items, call this method to build
-        the index. The specified parameter controls the number of trees in the
-        underlying Annoy index; a higher number will take longer to build but
-        provide more precision when querying.
+        After adding all of your items, call this method to build the index.
+        The meaning of parameter ``n`` is different for each backend
+        implementation. For the Annoy backend, it specifies the number of trees
+        in the underlying Annoy index (a higher number will take longer to
+        build but provide more precision when querying). For the Sklearn
+        backend, the number specifies the leaf size when building the ball
+        tree. (The Brute Force Pure Python backend ignores this value
+        entirely.)
 
         After you call build, you'll no longer be able to add new items to the
         index.
 
-        :param n: number of trees
+        :param n: backend-dependent (for Annoy: number of trees)
+        :param params: dictionary with extra parameters to pass to backend
         """
-        self.annoy.build(n)
+        self.backend.build(n, params)
         self.built = True
 
     def nearest(self, vec, n=12):
@@ -130,7 +147,7 @@ def nearest(self, vec, n=12):
         """
 
         return [self.corpus[idx] for idx
-                in self.annoy.get_nns_by_vector(vec, n)]
+                in self.backend.get_nns_by_vector(vec, n)]
 
     def neighbors(self, item, n=12):
         """Returns the items nearest another item in the index.
@@ -234,10 +251,10 @@ def dist(self, a, b):
         :param b: second item
         :returns: distance between ``a`` and ``b``
         """
-        return self.annoy.get_distance(self.id_map[a], self.id_map[b])
+        return self.backend.get_distance(self.id_map[a], self.id_map[b])
 
     def vec(self, item):
-        """Returns the vector for an item
+        """Returns the vector for an item.
 
         This method returns the vector that was originally provided when
         indexing the specified item. (Depending on how it was originally
@@ -247,7 +264,7 @@ def vec(self, item):
         :param item: item to lookup
         :returns: vector for item
         """
-        return self.annoy.get_item_vector(self.id_map[item])
+        return self.backend.get_item_vector(self.id_map[item])
 
     def __len__(self):
         """Returns the number of items in the vector"""
@@ -256,12 +273,14 @@ def __len__(self):
     def save(self, prefix):
         """Saves the index to disk.
 
-        This method saves the index to disk. Annoy indexes can't be serialized
-        with `pickle`, so this method produces two files: the serialized Annoy
-        index, and a pickle with the other data from the object. This method's
-        parameter specifies the "prefix" to use for these files. The Annoy
-        index will be saved as ``<prefix>.annoy`` and the object data will be
-        saved as ``<prefix>-data.pkl``.
+        This method saves the index to disk. Each backend manages serialization
+        a little bit differently: consult the documentation and source code for
+        more details. For example, because Annoy indexes can't be serialized
+        with `pickle`, the Annoy backend's implementation produces two files:
+        the serialized Annoy index, and a pickle with the other data from the
+        object.
+
+        This method's parameter specifies the "prefix" to use for these files.
 
         :param prefix: filename prefix for Annoy index and object data
         :returns: None
@@ -275,9 +294,10 @@ def save(self, prefix):
                 'i': self.i,
                 'built': self.built,
                 'metric': self.metric,
-                'dims': self.dims
+                'dims': self.dims,
+                '_backend_class': self.backend.__class__
             }, fh)
-        self.annoy.save(prefix + ".annoy")
+        self.backend.save(prefix + ".idx")
 
     @classmethod
     def load(cls, prefix):
@@ -286,19 +306,20 @@ def load(cls, prefix):
         This class method restores a previously-saved index using the specified
         file prefix.
 
-        :param prefix: prefix for AnnoyIndex file and object data pickle
+        :param prefix: prefix used when saving
         :returns: SimpleNeighbors object restored from specified files
         """
 
         with open(prefix + "-data.pkl", "rb") as fh:
             data = pickle.load(fh)
         newobj = cls(
             dims=data['dims'],
-            metric=data['metric']
+            metric=data['metric'],
+            backend=data['_backend_class']
         )
         newobj.id_map = data['id_map']
         newobj.corpus = data['corpus']
         newobj.i = data['i']
         newobj.built = data['built']
-        newobj.annoy.load(prefix + ".annoy")
+        newobj.backend.load(prefix + ".idx")
         return newobj