Skip to content

Commit

Permalink
Merge 6ccc9dc into 2ce74a0
Browse files Browse the repository at this point in the history
  • Loading branch information
aparrish committed Jan 13, 2020
2 parents 2ce74a0 + 6ccc9dc commit fae5c78
Show file tree
Hide file tree
Showing 15 changed files with 479 additions and 109 deletions.
12 changes: 5 additions & 7 deletions .travis.yml
Expand Up @@ -2,8 +2,8 @@

language: python
python:
- "pypy3"
- "pypy"
- "3.8"
- "3.7"
- "3.6"
- "3.5"
- "3.4"
Expand All @@ -14,12 +14,12 @@ sudo: false

# command to install dependencies, e.g. pip install -r requirements.txt --use-mirrors
install:
- pip install -r requirements.txt
- pip install -e .[dev]
- pip install coverage

# command to run tests, e.g. python setup.py test
script:
- coverage run --source simpleneighbors setup.py test --verbose
- coverage run --source simpleneighbors tests/test_simpleneighbors.py --verbose
- python -m doctest simpleneighbors/__init__.py

after_success:
Expand All @@ -28,6 +28,4 @@ after_success:

after_script:
- coverage report # show coverage on cmd line
- pip install pycodestyle pyflakes
- pyflakes . | tee >(wc -l) # static analysis
- pycodestyle --statistics --count . # static analysis
- flake8 simpleneighbors tests
5 changes: 0 additions & 5 deletions Makefile
Expand Up @@ -7,7 +7,6 @@ help:
@echo "clean-test - remove test and coverage artifacts"
@echo "lint - check style with flake8"
@echo "test - run tests quickly with the default Python"
@echo "test-all - run tests on every Python version with tox"
@echo "coverage - check code coverage quickly with the default Python"
@echo "docs - generate Sphinx HTML documentation, including API docs"
@echo "release - package and upload a release"
Expand All @@ -30,7 +29,6 @@ clean-pyc:
find . -name '__pycache__' -exec rm -fr {} +

clean-test:
rm -fr .tox/
rm -f .coverage
rm -fr htmlcov/

Expand All @@ -41,9 +39,6 @@ test:
python setup.py test
python -m doctest simpleneighbors/__init__.py

test-all:
tox

coverage:
coverage run --source simpleneighbors setup.py test
coverage report -m
Expand Down
80 changes: 61 additions & 19 deletions README.rst
Expand Up @@ -11,8 +11,12 @@ Simple Neighbors
:target: https://pypi.python.org/pypi/simpleneighbors

Simple Neighbors is a clean and easy interface for performing nearest-neighbor
lookups on items from a corpus. For example, here's how to find the most
similar color to a color in the `xkcd colors list
lookups on items from a corpus. To install the package::

pip install simpleneighbors[annoy]

Here's a quick example, showing how to find the names of colors most similar to
'pink' in the `xkcd colors list
<https://github.com/dariusk/corpora/blob/master/data/colors/xkcd.json>`_::

>>> from simpleneighbors import SimpleNeighbors
Expand All @@ -26,7 +30,16 @@ similar color to a color in the `xkcd colors list
>>> list(sim.neighbors('pink', 5))
['pink', 'bubblegum pink', 'pale magenta', 'dark mauve', 'light plum']

Read the documentation here: https://simpleneighbors.readthedocs.org.
For a more complete example, refer to my `Understanding Word Vectors notebook
<https://github.com/aparrish/rwet/blob/master/understanding-word-vectors.ipynb>`_,
which shows how to use Simple Neighbors to perform similarity lookups on word
vectors.

Read the complete Simple Neighbors documentation here:
https://simpleneighbors.readthedocs.org.

Why Simple Neighbors?
---------------------

Approximate nearest-neighbor lookups are a quick way to find the items in your
data set that are closest (or most similar to) any other item in your data, or
Expand All @@ -36,28 +49,57 @@ in a 300-dimensional space.

You could always perform pairwise distance calculations to find nearest
neighbors in your data, but for data of any appreciable size and complexity,
this kind of calculation is unbearably slow. This library uses `Annoy
<https://pypi.org/project/annoy/>`_ behind the scenes for approximate
nearest-neighbor lookups, which are ultimately a little less accurate than
pairwise calculations but much, much faster.
this kind of calculation is unbearably slow. Simple Neighbors uses one of a
handful of libraries behind the scenes to provide approximate nearest-neighbor
lookups, which are ultimately a little less accurate than pairwise calculations
but much, much faster.

The library also keeps track of your data, sparing you the extra step of
mapping each item in your data to its integer index in Annoy (at the potential
cost of some redundancy in data storage, depending on your application).
mapping each item in your data to its integer index (at the potential cost of
some redundancy in data storage, depending on your application).

I made Simple Neighbors because I use nearest neighbor lookups all the time and
found myself writing and rewriting the same bits of wrapper code over and over
again. I wanted to hide a little bit of the complexity of using these libraries
to make it easier to build small prototypes and teach workshops using
nearest-neighbor lookups.

Multiple backend support
------------------------

Simple Neighbors relies on the approximate nearest neighbor index
implementations found in other libraries. By default, Simple Neighbors will
choose the best backend based on the packages installed in your environment.
(You can also specify which backend to use by hand, or create your own.)

Currently supported backend libraries include:

* ``Annoy``: Erik Bernhardsson's `Annoy <https://pypi.org/project/annoy/>`_ library
* ``Sklearn``: `scikit-learn's NearestNeighbors <https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.NearestNeighbors.html#sklearn.neighbors.NearestNeighbors>`_
* ``BruteForcePurePython``: Pure Python brute-force search (included in package)

When you install Simple Neighbors, you can direct ``pip`` to install the
required packages for a given backend. For example, to install Simple Neighbors
with Annoy::

pip install simpleneighbors[annoy]

Annoy is highly recommended! This is the preferred way to use Simple Neighbors.

I made Simple Neighbors because I use Annoy all the time and found myself
writing and rewriting the same bits of wrapper code over and over again. I
wanted to hide a little bit of the complexity of using Annoy to make it easier
to build small prototypes and teach workshops using nearest-neighbor lookups.
To install Simple Neighbors alongside scikit-learn to use the ``Sklearn``
backend (which makes use of scikit-learn's `NearestNeighbors` class)::

Installation
------------
pip install simpleneighbors[sklearn]

Install with pip like so::
If you can't install Annoy or scikit-learn on your platform, you can also use a
pure Python backend::

pip install simpleneighbors
pip install simpleneighbors[purepython]

You can also download the source code and install manually::
Note that the pure Python version uses a brute force search and is therefore
very slow. In general, it's not suitable for datasets with more than a few
thousand items (or more than a handful of dimensions).

python setup.py install
See the documentation for the ``SimpleNeighbors`` class for more information on
specifying backends.

1 change: 0 additions & 1 deletion requirements.txt

This file was deleted.

15 changes: 13 additions & 2 deletions setup.py
Expand Up @@ -7,7 +7,7 @@

setup(
name='simpleneighbors',
version='0.0.1',
version='0.1.0',
author='Allison Parrish',
author_email='allison@decontextualize.com',
url='https://github.com/aparrish/simpleneighbors',
Expand All @@ -26,8 +26,19 @@
package_dir={'simpleneighbors': 'simpleneighbors'},
packages=['simpleneighbors'],
install_requires=[
'annoy'
],
extras_require={
'annoy': ['annoy>=1.16.0'],
'sklearn': ['scikit-learn>=0.20'],
'purepython': [],
'dev': [
'annoy>=1.16.0',
'scikit-learn>=0.20',
'mock;python_version<="2.7"',
'coverage',
'flake8',
]
},
platforms='any',
test_suite='tests'
)
89 changes: 55 additions & 34 deletions simpleneighbors/__init__.py
@@ -1,32 +1,44 @@
import pickle
import annoy
from simpleneighbors.backends import select_best

__author__ = 'Allison Parrish'
__email__ = 'allison@decontextualize.com'
__version__ = '0.0.1'
__version__ = '0.1.0'


class SimpleNeighbors:
"""A Simple Neighbors index.
You need to specify the number of dimensions in your data (i.e., the
length of the list or array you plan to provide for each item) and the
distance metric you want to use. (The default is "angular" distance,
i.e., cosine distance. You might also want to try "euclidean" for
Euclidean distance.) Both of these parameters are passed directly to
Annoy; see `the Annoy documentation <https://pypi.org/project/annoy/>`_
for more details.
This class wraps backend implementations of approximate nearest neighbors
indexes with a user-friendly API. When you instantiate this class, it will
automatically select a backend implementation based on packages installed
in your environment. It is HIGHLY RECOMMENDED that you install Annoy (``pip
install annoy``) to enable the Annoy backend! (The alternatives are
slower and not as accurate.) Alternatively, you can specify a backend of
your choosing with the ``backend`` parameter.
Specify the number of dimensions in your data (i.e., the length of the list
or array you plan to provide for each item) and the distance metric you
want to use. The default is ``angular`` distance, an approximation of
cosine distance. This metric is supported by all backends, as is
``euclidean`` (for Euclidean distance). Both of these parameters are passed
directly to the backend; see the backend documentation for more details.
:param dims: the number of dimensions in your data
:param metric: the distance metric to use
:param backend: the nearest neighbors backend to use (default is annoy)
"""

def __init__(self, dims, metric="angular"):
def __init__(self, dims, metric="angular", backend=None):

if backend is None:
backend = select_best()

self.dims = dims
self.metric = metric
self.id_map = {}
self.corpus = []
self.annoy = annoy.AnnoyIndex(dims, metric=metric)
self.backend = backend(dims, metric=metric)
self.i = 0
self.built = False

Expand All @@ -53,7 +65,7 @@ def add_one(self, item, vector):
"""

assert self.built is False, "Index already built; can't add new items."
self.annoy.add_item(self.i, vector)
self.backend.add_item(self.i, vector)
self.id_map[item] = self.i
self.corpus.append(item)
self.i += 1
Expand Down Expand Up @@ -88,20 +100,25 @@ def feed(self, items):
for item, vector in items:
self.add_one(item, vector)

def build(self, n=10):
def build(self, n=10, params=None):
"""Build the index.
After adding all of your items, call this method to build
the index. The specified parameter controls the number of trees in the
underlying Annoy index; a higher number will take longer to build but
provide more precision when querying.
After adding all of your items, call this method to build the index.
The meaning of parameter ``n`` is different for each backend
implementation. For the Annoy backend, it specifies the number of trees
in the underlying Annoy index (a higher number will take longer to
build but provide more precision when querying). For the Sklearn
backend, the number specifies the leaf size when building the ball
tree. (The Brute Force Pure Python backend ignores this value
entirely.)
After you call build, you'll no longer be able to add new items to the
index.
:param n: number of trees
:param n: backend-dependent (for Annoy: number of trees)
:param params: dictionary with extra parameters to pass to backend
"""
self.annoy.build(n)
self.backend.build(n, params)
self.built = True

def nearest(self, vec, n=12):
Expand Down Expand Up @@ -130,7 +147,7 @@ def nearest(self, vec, n=12):
"""

return [self.corpus[idx] for idx
in self.annoy.get_nns_by_vector(vec, n)]
in self.backend.get_nns_by_vector(vec, n)]

def neighbors(self, item, n=12):
"""Returns the items nearest another item in the index.
Expand Down Expand Up @@ -234,10 +251,10 @@ def dist(self, a, b):
:param b: second item
:returns: distance between ``a`` and ``b``
"""
return self.annoy.get_distance(self.id_map[a], self.id_map[b])
return self.backend.get_distance(self.id_map[a], self.id_map[b])

def vec(self, item):
"""Returns the vector for an item
"""Returns the vector for an item.
This method returns the vector that was originally provided when
indexing the specified item. (Depending on how it was originally
Expand All @@ -247,7 +264,7 @@ def vec(self, item):
:param item: item to lookup
:returns: vector for item
"""
return self.annoy.get_item_vector(self.id_map[item])
return self.backend.get_item_vector(self.id_map[item])

def __len__(self):
"""Returns the number of items in the vector"""
Expand All @@ -256,12 +273,14 @@ def __len__(self):
def save(self, prefix):
"""Saves the index to disk.
This method saves the index to disk. Annoy indexes can't be serialized
with `pickle`, so this method produces two files: the serialized Annoy
index, and a pickle with the other data from the object. This method's
parameter specifies the "prefix" to use for these files. The Annoy
index will be saved as ``<prefix>.annoy`` and the object data will be
saved as ``<prefix>-data.pkl``.
This method saves the index to disk. Each backend manages serialization
a little bit differently: consult the documentation and source code for
more details. For example, because Annoy indexes can't be serialized
with `pickle`, the Annoy backend's implementation produces two files:
the serialized Annoy index, and a pickle with the other data from the
object.
This method's parameter specifies the "prefix" to use for these files.
:param prefix: filename prefix for Annoy index and object data
:returns: None
Expand All @@ -275,9 +294,10 @@ def save(self, prefix):
'i': self.i,
'built': self.built,
'metric': self.metric,
'dims': self.dims
'dims': self.dims,
'_backend_class': self.backend.__class__
}, fh)
self.annoy.save(prefix + ".annoy")
self.backend.save(prefix + ".idx")

@classmethod
def load(cls, prefix):
Expand All @@ -286,19 +306,20 @@ def load(cls, prefix):
This class method restores a previously-saved index using the specified
file prefix.
:param prefix: prefix for AnnoyIndex file and object data pickle
:param prefix: prefix used when saving
:returns: SimpleNeighbors object restored from specified files
"""

with open(prefix + "-data.pkl", "rb") as fh:
data = pickle.load(fh)
newobj = cls(
dims=data['dims'],
metric=data['metric']
metric=data['metric'],
backend=data['_backend_class']
)
newobj.id_map = data['id_map']
newobj.corpus = data['corpus']
newobj.i = data['i']
newobj.built = data['built']
newobj.annoy.load(prefix + ".annoy")
newobj.backend.load(prefix + ".idx")
return newobj

0 comments on commit fae5c78

Please sign in to comment.