Callable metric function for custom distances #11

tejasshah93 · 2017-03-10T11:08:36Z

Hi,

PySparNN looks great! I'd like to know if there is a provision for using custom distances as a metric while searching for nearest neighbors.

E.g. For sklearn.neighbors.NearestNeighbors, the documentation mentions:

metric : string or callable
If metric is a callable function, it is called on each pair of instances (rows) and the resulting value recorded. The callable should take two arrays as input and return one value indicating the distance between them.

Do we have such a provision for custom distances in PySparNN?
Thanks.

spencebeecher · 2017-03-11T21:20:06Z

Hi @tejasshah93 - Great question!

There isnt anything custom like that out of the box. However it is pretty easy to implement your own custom distance metrics. By default, pysparnn uses sparse matricies + cosine similarity. This is where the library really shines. I am not aware of many python based libraries that do sparse search accurately and efficiently (https://github.com/ekzhu/datasketch and https://github.com/src-d/minhashcuda both look good).

However here is an example (already in the code) of using an alternative distance. https://github.com/facebookresearch/pysparnn/blob/master/examples/dense_matrix.ipynb

Here is an example (that should mostly work) that will let you use custom distances. scipy.spatial.distance.cdist provides a variety of distance types. Caution - I have found these distance functions to be much slower than what is currently in pysparnn. This is probably reasonable for testing but writing customized & efficient code is probably the best way forward. More examples of custom distances in pysparnn can be found here: https://github.com/facebookresearch/pysparnn/blob/master/pysparnn/matrix_distance.py

class SlowEuclideanDistance(MatrixMetricSearch):
    def __init__(self, records_features, records_data):
        super(SlowEuclideanDistance, self).__init__(records_features, 
                                                    records_data, 
                                                    is_similarity=False)
        self.matrix = self.matrix.toarray()

    @staticmethod
    def _transform_value(self, v):
        return v

    @staticmethod
    def features_to_matrix(features):
        """
        Args:
            val: A list of features to be formatted.
        Returns:
            The transformed matrix.
        """
        return np.array(features, ndmin=2)

    @staticmethod
    def vstack(matrix_list):
        """
        Args:
            val: A list of features to be formatted.
        Returns:
            The transformed matrix.
        """
        return np.vstack(matrix_list)

    def _transform_value(self, v):
        return v
    def _similarity(self, a_matrix):
        """Euclidean distance"""
        # need to handle fipping argmin k to positive
        return scipy.sparse.csr_matrix(scipy.spatial.distance.cdist(
                a_matrix.toarray(), 
                self.matrix, 'euclidean'))

spencebeecher · 2017-03-18T22:27:48Z

https://github.com/facebookresearch/pysparnn/blob/master/pysparnn/matrix_distance.py#L249

Any updates on this? If I dont hear back from you in a week I am going to close this out!

tejasshah93 · 2017-03-19T13:25:11Z

Hi @spencebeecher - Apologies for not getting back earlier. Missed the notification!
Thanks for the prompt and detailed response for the issue raised along with the other insights!

From what I understood, in order to define my own custom distance metric in pysparnn, I have to define a class UserCustomDistance (say), it's constructor and methods as mentioned in the snippet above / here viz.,
_transform_value, features_to_matrix, vstack and
_distance: which basically defines the distance metric to be operated over the matrices.

Having written the above class and it's methods, simply using distance_type=pysparnn.matrix_distance.UserCustomDistance while calculating the nearest neighbors would do the job. Sounds right?

Let me know if I'm not headed in the right direction. Else, you may close the issue for now. And I'll reopen it for clarifications later while implementation, if any. Thanks!

spencebeecher · 2017-03-19T15:19:51Z

That is totally it! Please let me know if you run into any issues or have suggestions for improvement! Thanks!

tejasshah93 · 2017-03-19T17:08:09Z

Sure thing! 👍

tejasshah93 · 2017-03-22T22:46:17Z

Hi @spencebeecher,

As discussed, I tried implementing a custom distance class and it's allied methods with using sklearn.metrics.pairwise.pairwise_distances in _distance() and a callable metric function as an argument to it.

The training dataset is a sparse matrix of size NxN (value of N is around 900,000)
(Machine configuration: 64 GB RAM, 12 cores)

However, it seems that the index creation itself is taking a bit long to complete (4 hours and still running). Had a brief look at cluster_index.py but couldn't find if there's a parameter to specify the number of workers/jobs for index creation.

Is this an expected behavior while using sklearn.metrics.pairwise.pairwise_distances (as you probably mentioned earlier)? Or if you could let me know the time complexity for indexing with respect to size of the matrix.

Thanks!

spencebeecher · 2017-03-22T22:48:33Z

Hi! Could you share a copy of the code? I can try to reproduce on my end. Are you able to get it to work with a different distance measure?

…

On Mar 22, 2017 3:46 PM, "Tejas Shah" ***@***.***> wrote: Hi @spencebeecher <https://github.com/spencebeecher>, As discussed, I tried implementing a custom distance class and it's allied methods with using sklearn.metrics.pairwise.pairwise_distances <http://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.pairwise_distances.html> in _distance() and a callable metric function as an argument to it. The training dataset is a sparse matrix of size NxN (value of N is around 900,000) (Machine configuration: 64 GB RAM, 12 cores) However, it seems that the index creation itself is taking a bit long to complete (4 hours and still running). Had a brief look at cluster_index.py <https://github.com/facebookresearch/pysparnn/blob/master/pysparnn/cluster_index.py> but couldn't find if there's a parameter to specify the number of workers/jobs for index creation. Is this an expected behavior while using sklearn.metrics.pairwise. pairwise_distances (as you probably mentioned earlier)? Or if you could let me know the time complexity for indexing with respect to the size of the matrix. Thanks! — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#11 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAXLXSqJ-XamPM1UDSN-r9bOXiww_aiIks5roaS6gaJpZM4MZQu1> .

tejasshah93 · 2017-03-23T01:10:41Z

Hi @spencebeecher

Indeed. I ran the code against a small dummy dataset for validating the nearest neighbors and the custom distance values - it seems to be working as expected! This gist represents the custom distance class written (modified a value for testing purpose). After appending this class to matrix_distance.py, updated the bindings by executing $ python setup.py install --user from repository folder.

For reproducing the experiment on your end, you can create a random sparse matrix of NxN (N=900000) and run

>>> cp = ci.MultiClusterIndex(dataset, data_to_return, distance_type=pysparnn.matrix_distance.UserCustomDistance)

where data_to_return = range(dataset.shape[0])

Let me know if anything doesn't seem fine or otherwise. Thanks!

spencebeecher · 2017-03-23T15:59:37Z

Thanks for the detailed reply - I'll have a look tonight and get back to you. I suspect that the sklearn distance metric may not be as performant as we would like. I might be able to write something custom. Thanks again!

…

On Mar 22, 2017 6:10 PM, "Tejas Shah" ***@***.***> wrote: Hi @spencebeecher <https://github.com/spencebeecher> Indeed. I ran the code against a small dummy dataset for validating the nearest neighbors and the custom distance values - it seems to be working as expected! This gist <https://gist.github.com/tejasshah93/8dc7dcbc7463f36e171d5db66355ff87> represents the custom distance class written (modified a value for testing purpose). After appending this class to matrix_distance.py <https://github.com/facebookresearch/pysparnn/blob/master/pysparnn/matrix_distance.py>, updated the bindings by executing $ python setup.py install --user from repository folder. For reproducing the experiment on your end, you can create a random sparse matrix of NxN (N=900000) and run >>> cp = ci.MultiClusterIndex(dataset, data_to_return, distance_type=pysparnn.matrix_distance.UserCustomDistance) where data_to_return = range(dataset.shape[0]) Let me know if anything doesn't seem fine or otherwise. Thanks! — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#11 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAXLXfMfUN8Spd0Otba2sW3Ku8E0Euiyks5rocaRgaJpZM4MZQu1> .

tejasshah93 · 2017-03-23T23:22:56Z

Appreciate your prompt response @spencebeecher!
For diagnosis, I'd like to let you know that the index creation step isn't completed yet (> 24 hrs and still running). So yes, maybe your suspicion regarding sklearn/scipy distance metric performance is right.

Sure, let me know when you're possibly done with the implementation (or notify me if I've erred somewhere while writing the distance class).

Thanks!

spencebeecher · 2017-03-24T05:35:10Z

My sln may be faster - i tested the new impl on sample data and was able to get the same results. I am currently running this on my laptop which does not have much power. Ill let you know how log it took to index. Note that i did make the matrix rather sparse (only 10 elements per record on average)

tejasshah93 · 2017-03-24T19:31:37Z

Hi,

I'm not entirely sure what you're attempting when defining the distance metric here. Guess that's why the results are different for the sample matrices mentioned in the gist. (how did you get the same results?)

In [6]: u_dist = UserCustomDistance(A, range(A.shape[0]))
    ...: u_dist._distance(A)
Out[6]: 
array([[-2.,  1.,  0.],
       [ 1.,  1.,  1.],
       [ 0.,  1.,  0.]])

In [7]: su_dist = SpencerUserCustomDistance(A, range(A.shape[0]))
    ...: su_dist._distance(A)
Out[7]: 
array([[-4,  2,  0],
       [-1,  0,  0],
       [-2,  1,  0]])

For clarifications, the custom distance value is computed as follows:
def user_distance_metric(): np.minimum over the data of two sparse vectors to be compared. And then subtracting sum of the resultant array from self.max_overlap

However, even when I proceeded with creating the index as mentioned in the gist, it aborted with the following error:

. .
/home/tejas/.local/lib/python2.7/site-packages/pysparnn/cluster_index.pyc in __init__(self, features, records_data, distance_type, matrix_size, parent)
    135                 max_rng = min(rng + rng_step, features.shape[0])
    136                 records_rng = features[rng:max_rng]
--> 137                 for i, clstrs in enumerate(root.nearest_search(records_rng, k=1)):
    138                     for _, cluster in clstrs:
    139                         item_to_clusters[cluster].append(i + rng)

/home/tejas/.local/lib/python2.7/site-packages/pysparnn/matrix_distance.pyc in nearest_search(self, features, k, max_distance)
    102         """
    103 
--> 104         dist_matrix = self._distance(features)
    105 
    106         if max_distance is None:

<ipython-input-2-9ac6b3897d5b> in _distance(self, a_matrix)
     69 
     70     def _distance(self, a_matrix):
---> 71         return np.array(self.matrix.sum(axis=1) - self.matrix.dot(a_matrix.transpose()).transpose())
     72 
     73 

/home/tejas/.local/lib/python2.7/site-packages/scipy/sparse/base.pyc in __mul__(self, other)
    350         if issparse(other):
    351             if self.shape[1] != other.shape[0]:
--> 352                 raise ValueError('dimension mismatch')
    353             return self._mul_sparse_matrix(other)
    354 

ValueError: dimension mismatch

Any headers?

spencebeecher · 2017-03-27T03:48:24Z

Whoopse - you are totally right! I had a bug that i must have introduced. Try this code. I haven only tried it on 100k records not 1mil.

https://gist.github.com/spencebeecher/04e79631e9725ea25889a53117319412

(same as the gist just with notebook output) tej_notebook.pdf

tejasshah93 · 2017-03-29T20:06:57Z

Hi @spencebeecher ,

I tried the above mentioned modified code for 1 million records and it executed in just about 225 seconds! (where matrix.sum(axis=1).mean() returned 99.997115170633236). That's really fast as compared!

As you mentioned, the results are same for sample matrices this time. However, when I checked the output for sample matrices from the previous gist, it's giving different results: I'm not sure of the approach used in defining the custom distance in SpencerUserCustomDistance. Maybe have a glance at it once?

Thanks!

spencebeecher · 2017-03-30T05:30:31Z

It was a bug (kinda)!

I had 'max_overlap' specified as unique per record matrix.sum(axis=1). This means that the max_distance would vary by record in the index. Setting max_overlap overlap to matrix.shape[0] fixed the issue.

Old and new

spencebeecher closed this as completed Mar 19, 2017

spencebeecher reopened this Mar 24, 2017

spencebeecher closed this as completed Mar 30, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Callable metric function for custom distances #11

Callable metric function for custom distances #11

tejasshah93 commented Mar 10, 2017

spencebeecher commented Mar 11, 2017 •

edited

spencebeecher commented Mar 18, 2017

tejasshah93 commented Mar 19, 2017 •

edited

spencebeecher commented Mar 19, 2017

tejasshah93 commented Mar 19, 2017

tejasshah93 commented Mar 22, 2017 •

edited

spencebeecher commented Mar 22, 2017 via email

tejasshah93 commented Mar 23, 2017

spencebeecher commented Mar 23, 2017 via email

tejasshah93 commented Mar 23, 2017

spencebeecher commented Mar 24, 2017

tejasshah93 commented Mar 24, 2017

spencebeecher commented Mar 27, 2017 •

edited

tejasshah93 commented Mar 29, 2017 •

edited

spencebeecher commented Mar 30, 2017

Callable metric function for custom distances #11

Callable metric function for custom distances #11

Comments

tejasshah93 commented Mar 10, 2017

spencebeecher commented Mar 11, 2017 • edited

spencebeecher commented Mar 18, 2017

tejasshah93 commented Mar 19, 2017 • edited

spencebeecher commented Mar 19, 2017

tejasshah93 commented Mar 19, 2017

tejasshah93 commented Mar 22, 2017 • edited

spencebeecher commented Mar 22, 2017 via email

tejasshah93 commented Mar 23, 2017

spencebeecher commented Mar 23, 2017 via email

tejasshah93 commented Mar 23, 2017

spencebeecher commented Mar 24, 2017

tejasshah93 commented Mar 24, 2017

spencebeecher commented Mar 27, 2017 • edited

tejasshah93 commented Mar 29, 2017 • edited

spencebeecher commented Mar 30, 2017

spencebeecher commented Mar 11, 2017 •

edited

tejasshah93 commented Mar 19, 2017 •

edited

tejasshah93 commented Mar 22, 2017 •

edited

spencebeecher commented Mar 27, 2017 •

edited

tejasshah93 commented Mar 29, 2017 •

edited