Skip to content
This repository has been archived by the owner on Aug 31, 2021. It is now read-only.

Commit

Permalink
init
Browse files Browse the repository at this point in the history
  • Loading branch information
spencebeecher committed Mar 28, 2016
0 parents commit 9cf5b0a
Show file tree
Hide file tree
Showing 15 changed files with 767 additions and 0 deletions.
36 changes: 36 additions & 0 deletions CONTRIBUTING
@@ -0,0 +1,36 @@
# Contributing to PySparNN
We want to make contributing to this project as easy and transparent as
possible.

## Pull Requests
We actively welcome your pull requests.

1. Fork the repo and create your branch from `master`.
2. If you've added code that should be tested, add tests.
3. If you've changed APIs, update the documentation.
4. Ensure the test suite passes.
5. Make sure your code lints.
6. If you haven't already, complete the Contributor License Agreement ("CLA").

## Contributor License Agreement ("CLA")
In order to accept your pull request, we need you to submit a CLA. You only need
to do this once to work on any of Facebook's open source projects.

Complete your CLA here: <https://code.facebook.com/cla>

## Issues
We use GitHub issues to track public bugs. Please ensure your description is
clear and has sufficient instructions to be able to reproduce the issue.

Facebook has a [bounty program](https://www.facebook.com/whitehat/) for the safe
disclosure of security bugs. In those cases, please go through the process
outlined on that page and do not file a public issue.

## Coding Style
* 2 spaces for indentation rather than tabs
* 80 character line length
* TODO: Finish THIS

## License
By contributing to PySparNN, you agree that your contributions will be licensed
under its BSD license.
30 changes: 30 additions & 0 deletions LICENSE.md
@@ -0,0 +1,30 @@
BSD License

For PySparNN software

Copyright (c) 2016-present, Facebook, Inc. All rights reserved.

Redistribution and use in source and binary forms, with or without modification,
are permitted provided that the following conditions are met:

* Redistributions of source code must retain the above copyright notice, this
list of conditions and the following disclaimer.

* Redistributions in binary form must reproduce the above copyright notice,
this list of conditions and the following disclaimer in the documentation
and/or other materials provided with the distribution.

* Neither the name Facebook nor the names of its contributors may be used to
endorse or promote products derived from this software without specific
prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR
ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON
ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
33 changes: 33 additions & 0 deletions PATENTS
@@ -0,0 +1,33 @@
Additional Grant of Patent Rights Version 2

"Software" means the PySparNN software distributed by Facebook, Inc.

Facebook, Inc. ("Facebook") hereby grants to each recipient of the Software
("you") a perpetual, worldwide, royalty-free, non-exclusive, irrevocable
(subject to the termination provision below) license under any Necessary
Claims, to make, have made, use, sell, offer to sell, import, and otherwise
transfer the Software. For avoidance of doubt, no license is granted under
Facebook’s rights in any patent claims that are infringed by (i) modifications
to the Software made by you or any third party or (ii) the Software in
combination with any software or other technology.

The license granted hereunder will terminate, automatically and without notice,
if you (or any of your subsidiaries, corporate affiliates or agents) initiate
directly or indirectly, or take a direct financial interest in, any Patent
Assertion: (i) against Facebook or any of its subsidiaries or corporate
affiliates, (ii) against any party if such Patent Assertion arises in whole or
in part from any software, technology, product or service of Facebook or any of
its subsidiaries or corporate affiliates, or (iii) against any party relating
to the Software. Notwithstanding the foregoing, if Facebook or any of its
subsidiaries or corporate affiliates files a lawsuit alleging patent
infringement against you in the first instance, and you respond by filing a
patent infringement counterclaim in that lawsuit against that party that is
unrelated to the Software, the license granted hereunder will not terminate
under section (i) of this paragraph due to such counterclaim.

A "Necessary Claim" is a claim of a patent owned by Facebook that is
necessarily infringed by the Software standing alone.

A "Patent Assertion" is any lawsuit or other action alleging direct, indirect,
or contributory infringement or inducement to infringe any patent, including a
cross-claim or counterclaim.
66 changes: 66 additions & 0 deletions README.md
@@ -0,0 +1,66 @@
Blockers:
* Finsih contributing file
* pylint
* matrix vector mulitply discussion

# PySparNN
Sparse (approximate) nearest neighbor search for python! This library is well suited to finding nearest neighbors in sparse, high dimensional spaces (like a text doccuments).

Out of the box, PySparNN supports Cosine Similarity.

PySparNN can be easily extended with abritrary similarity metrics (Manhattan, Eculidian, Jaccard, etc).

If your data is NOT SPARSE & you don't require a custom distance function - please consider [annoy](https://github.com/spotify/annoy).
It uses a similar-ish method and I am a big fan of it. As of this writing, annoy performs ~8x faster on their introductory example.

The most comparable library to PySparNN is scikit-learn's LSHForrest module. As of this writing, PySparNN is ~30% faster on the 20newsgroups dataset. [Here is the comparison.](https://github.com/facebook/PySparNN/blob/master/sparse_search_comparison.ipynb)

Notes:
* A future update may allow incremental insertions.

## Example Usage
```
import pysparnn as snn
data = [
'hello world',
'oh hello there',
'Play it',
'Play it again Sam',
]
features = [dict([(x, 1) for x in f.split()]) for f in data]
cp = snn.ClusterIndex(features, data)
cp.search(features, threshold=0.50, k=1, return_similarity=False)
>> [[u'hello world'], [u'oh hello there'], [u'Play it'], [u'Play it again Sam']]
cp.search(features, threshold=0.50, k_clusters=2, k=2, return_similarity=False)
>> [[u'hello world'],
>> [u'oh hello there'],
>> [u'Play it', u'Play it again Sam'],
>> [u'Play it again Sam', u'Play it']]
```

## Requirements
PySparNN requires numpy. Tested with numpy 1.10.4.

## How PySparNN works
Searching for a document in an collection of K documents is naievely O(K) (assuming documents are constant sized).

However! we can create a tree structure where the first level is O(sqrt(K)) and each of the leaves are also O(sqrt(K)).

We randomly pick sqrt(K) items to be in the top level. Then for each of the K doccuments - assign it to the closest neighbor in the top
level.

This breaks up one O(K) search into two O(sqrt(K)) searches which is much much faster when K is big!

## Further Information
http://nlp.stanford.edu/IR-book/html/htmledition/cluster-pruning-1.html

See the CONTRIBUTING file for how to help out.

## License
PySparNN is BSD-licensed. We also provide an additional patent grant.

0 comments on commit 9cf5b0a

Please sign in to comment.