Skip to content
This repository has been archived by the owner on Aug 31, 2021. It is now read-only.

Callable metric function for custom distances #11

Closed
tejasshah93 opened this issue Mar 10, 2017 · 15 comments
Closed

Callable metric function for custom distances #11

tejasshah93 opened this issue Mar 10, 2017 · 15 comments

Comments

@tejasshah93
Copy link

Hi,

PySparNN looks great! I'd like to know if there is a provision for using custom distances as a metric while searching for nearest neighbors.

E.g. For sklearn.neighbors.NearestNeighbors, the documentation mentions:

metric : string or callable
If metric is a callable function, it is called on each pair of instances (rows) and the resulting value recorded. The callable should take two arrays as input and return one value indicating the distance between them.

Do we have such a provision for custom distances in PySparNN?
Thanks.

@spencebeecher
Copy link
Contributor

spencebeecher commented Mar 11, 2017

Hi @tejasshah93 - Great question!

There isnt anything custom like that out of the box. However it is pretty easy to implement your own custom distance metrics. By default, pysparnn uses sparse matricies + cosine similarity. This is where the library really shines. I am not aware of many python based libraries that do sparse search accurately and efficiently (https://github.com/ekzhu/datasketch and https://github.com/src-d/minhashcuda both look good).

However here is an example (already in the code) of using an alternative distance. https://github.com/facebookresearch/pysparnn/blob/master/examples/dense_matrix.ipynb

Here is an example (that should mostly work) that will let you use custom distances. scipy.spatial.distance.cdist provides a variety of distance types. Caution - I have found these distance functions to be much slower than what is currently in pysparnn. This is probably reasonable for testing but writing customized & efficient code is probably the best way forward. More examples of custom distances in pysparnn can be found here: https://github.com/facebookresearch/pysparnn/blob/master/pysparnn/matrix_distance.py

class SlowEuclideanDistance(MatrixMetricSearch):
    def __init__(self, records_features, records_data):
        super(SlowEuclideanDistance, self).__init__(records_features, 
                                                    records_data, 
                                                    is_similarity=False)
        self.matrix = self.matrix.toarray()

    @staticmethod
    def _transform_value(self, v):
        return v

    @staticmethod
    def features_to_matrix(features):
        """
        Args:
            val: A list of features to be formatted.
        Returns:
            The transformed matrix.
        """
        return np.array(features, ndmin=2)

    @staticmethod
    def vstack(matrix_list):
        """
        Args:
            val: A list of features to be formatted.
        Returns:
            The transformed matrix.
        """
        return np.vstack(matrix_list)

    def _transform_value(self, v):
        return v
    def _similarity(self, a_matrix):
        """Euclidean distance"""
        # need to handle fipping argmin k to positive
        return scipy.sparse.csr_matrix(scipy.spatial.distance.cdist(
                a_matrix.toarray(), 
                self.matrix, 'euclidean'))

@spencebeecher
Copy link
Contributor

https://github.com/facebookresearch/pysparnn/blob/master/pysparnn/matrix_distance.py#L249

Any updates on this? If I dont hear back from you in a week I am going to close this out!

@tejasshah93
Copy link
Author

tejasshah93 commented Mar 19, 2017

Hi @spencebeecher - Apologies for not getting back earlier. Missed the notification!
Thanks for the prompt and detailed response for the issue raised along with the other insights!

From what I understood, in order to define my own custom distance metric in pysparnn, I have to define a class UserCustomDistance (say), it's constructor and methods as mentioned in the snippet above / here viz.,
_transform_value, features_to_matrix, vstack and
_distance: which basically defines the distance metric to be operated over the matrices.

Having written the above class and it's methods, simply using distance_type=pysparnn.matrix_distance.UserCustomDistance while calculating the nearest neighbors would do the job. Sounds right?

Let me know if I'm not headed in the right direction. Else, you may close the issue for now. And I'll reopen it for clarifications later while implementation, if any. Thanks!

@spencebeecher
Copy link
Contributor

That is totally it! Please let me know if you run into any issues or have suggestions for improvement! Thanks!

@tejasshah93
Copy link
Author

Sure thing! 👍

@tejasshah93
Copy link
Author

tejasshah93 commented Mar 22, 2017

Hi @spencebeecher,

As discussed, I tried implementing a custom distance class and it's allied methods with using sklearn.metrics.pairwise.pairwise_distances in _distance() and a callable metric function as an argument to it.

The training dataset is a sparse matrix of size NxN (value of N is around 900,000)
(Machine configuration: 64 GB RAM, 12 cores)

However, it seems that the index creation itself is taking a bit long to complete (4 hours and still running). Had a brief look at cluster_index.py but couldn't find if there's a parameter to specify the number of workers/jobs for index creation.

Is this an expected behavior while using sklearn.metrics.pairwise.pairwise_distances (as you probably mentioned earlier)? Or if you could let me know the time complexity for indexing with respect to size of the matrix.

Thanks!

@spencebeecher
Copy link
Contributor

spencebeecher commented Mar 22, 2017 via email

@tejasshah93
Copy link
Author

Hi @spencebeecher

Indeed. I ran the code against a small dummy dataset for validating the nearest neighbors and the custom distance values - it seems to be working as expected! This gist represents the custom distance class written (modified a value for testing purpose). After appending this class to matrix_distance.py, updated the bindings by executing $ python setup.py install --user from repository folder.

For reproducing the experiment on your end, you can create a random sparse matrix of NxN (N=900000) and run

>>> cp = ci.MultiClusterIndex(dataset, data_to_return, distance_type=pysparnn.matrix_distance.UserCustomDistance)

where data_to_return = range(dataset.shape[0])

Let me know if anything doesn't seem fine or otherwise. Thanks!

@spencebeecher
Copy link
Contributor

spencebeecher commented Mar 23, 2017 via email

@tejasshah93
Copy link
Author

Appreciate your prompt response @spencebeecher!
For diagnosis, I'd like to let you know that the index creation step isn't completed yet (> 24 hrs and still running). So yes, maybe your suspicion regarding sklearn/scipy distance metric performance is right.

Sure, let me know when you're possibly done with the implementation (or notify me if I've erred somewhere while writing the distance class).

Thanks!

@spencebeecher spencebeecher reopened this Mar 24, 2017
@spencebeecher
Copy link
Contributor

My sln may be faster - i tested the new impl on sample data and was able to get the same results. I am currently running this on my laptop which does not have much power. Ill let you know how log it took to index. Note that i did make the matrix rather sparse (only 10 elements per record on average)

@tejasshah93
Copy link
Author

Hi,

I'm not entirely sure what you're attempting when defining the distance metric here. Guess that's why the results are different for the sample matrices mentioned in the gist. (how did you get the same results?)

In [6]: u_dist = UserCustomDistance(A, range(A.shape[0]))
    ...: u_dist._distance(A)
Out[6]: 
array([[-2.,  1.,  0.],
       [ 1.,  1.,  1.],
       [ 0.,  1.,  0.]])

In [7]: su_dist = SpencerUserCustomDistance(A, range(A.shape[0]))
    ...: su_dist._distance(A)
Out[7]: 
array([[-4,  2,  0],
       [-1,  0,  0],
       [-2,  1,  0]])

For clarifications, the custom distance value is computed as follows:
def user_distance_metric(): np.minimum over the data of two sparse vectors to be compared. And then subtracting sum of the resultant array from self.max_overlap


However, even when I proceeded with creating the index as mentioned in the gist, it aborted with the following error:
. .
/home/tejas/.local/lib/python2.7/site-packages/pysparnn/cluster_index.pyc in __init__(self, features, records_data, distance_type, matrix_size, parent)
    135                 max_rng = min(rng + rng_step, features.shape[0])
    136                 records_rng = features[rng:max_rng]
--> 137                 for i, clstrs in enumerate(root.nearest_search(records_rng, k=1)):
    138                     for _, cluster in clstrs:
    139                         item_to_clusters[cluster].append(i + rng)

/home/tejas/.local/lib/python2.7/site-packages/pysparnn/matrix_distance.pyc in nearest_search(self, features, k, max_distance)
    102         """
    103 
--> 104         dist_matrix = self._distance(features)
    105 
    106         if max_distance is None:

<ipython-input-2-9ac6b3897d5b> in _distance(self, a_matrix)
     69 
     70     def _distance(self, a_matrix):
---> 71         return np.array(self.matrix.sum(axis=1) - self.matrix.dot(a_matrix.transpose()).transpose())
     72 
     73 

/home/tejas/.local/lib/python2.7/site-packages/scipy/sparse/base.pyc in __mul__(self, other)
    350         if issparse(other):
    351             if self.shape[1] != other.shape[0]:
--> 352                 raise ValueError('dimension mismatch')
    353             return self._mul_sparse_matrix(other)
    354 

ValueError: dimension mismatch

Any headers?

@spencebeecher
Copy link
Contributor

spencebeecher commented Mar 27, 2017

Whoopse - you are totally right! I had a bug that i must have introduced. Try this code. I haven only tried it on 100k records not 1mil.

https://gist.github.com/spencebeecher/04e79631e9725ea25889a53117319412

(same as the gist just with notebook output) tej_notebook.pdf

@tejasshah93
Copy link
Author

tejasshah93 commented Mar 29, 2017

Hi @spencebeecher ,

I tried the above mentioned modified code for 1 million records and it executed in just about 225 seconds! (where matrix.sum(axis=1).mean() returned 99.997115170633236). That's really fast as compared!

As you mentioned, the results are same for sample matrices this time. However, when I checked the output for sample matrices from the previous gist, it's giving different results: I'm not sure of the approach used in defining the custom distance in SpencerUserCustomDistance. Maybe have a glance at it once?

Thanks!

@spencebeecher
Copy link
Contributor

It was a bug (kinda)!

I had 'max_overlap' specified as unique per record matrix.sum(axis=1). This means that the max_distance would vary by record in the index. Setting max_overlap overlap to matrix.shape[0] fixed the issue.

Old and new

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants