Callable metric function for custom distances #11
Comments
Hi @tejasshah93 - Great question! There isnt anything custom like that out of the box. However it is pretty easy to implement your own custom distance metrics. By default, pysparnn uses sparse matricies + cosine similarity. This is where the library really shines. I am not aware of many python based libraries that do sparse search accurately and efficiently (https://github.com/ekzhu/datasketch and https://github.com/src-d/minhashcuda both look good). However here is an example (already in the code) of using an alternative distance. https://github.com/facebookresearch/pysparnn/blob/master/examples/dense_matrix.ipynb Here is an example (that should mostly work) that will let you use custom distances. scipy.spatial.distance.cdist provides a variety of distance types. Caution - I have found these distance functions to be much slower than what is currently in pysparnn. This is probably reasonable for testing but writing customized & efficient code is probably the best way forward. More examples of custom distances in pysparnn can be found here: https://github.com/facebookresearch/pysparnn/blob/master/pysparnn/matrix_distance.py
|
https://github.com/facebookresearch/pysparnn/blob/master/pysparnn/matrix_distance.py#L249 Any updates on this? If I dont hear back from you in a week I am going to close this out! |
Hi @spencebeecher - Apologies for not getting back earlier. Missed the notification! From what I understood, in order to define my own custom distance metric in pysparnn, I have to define a class Having written the above class and it's methods, simply using Let me know if I'm not headed in the right direction. Else, you may close the issue for now. And I'll reopen it for clarifications later while implementation, if any. Thanks! |
That is totally it! Please let me know if you run into any issues or have suggestions for improvement! Thanks! |
Sure thing! 👍 |
Hi @spencebeecher, As discussed, I tried implementing a custom distance class and it's allied methods with using sklearn.metrics.pairwise.pairwise_distances in The training dataset is a sparse matrix of size NxN (value of N is around 900,000) However, it seems that the index creation itself is taking a bit long to complete (4 hours and still running). Had a brief look at cluster_index.py but couldn't find if there's a parameter to specify the number of workers/jobs for index creation. Is this an expected behavior while using Thanks! |
Hi! Could you share a copy of the code? I can try to reproduce on my end.
Are you able to get it to work with a different distance measure?
…On Mar 22, 2017 3:46 PM, "Tejas Shah" ***@***.***> wrote:
Hi @spencebeecher <https://github.com/spencebeecher>,
As discussed, I tried implementing a custom distance class and it's allied
methods with using sklearn.metrics.pairwise.pairwise_distances
<http://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.pairwise_distances.html>
in _distance() and a callable metric function as an argument to it.
The training dataset is a sparse matrix of size NxN (value of N is around
900,000)
(Machine configuration: 64 GB RAM, 12 cores)
However, it seems that the index creation itself is taking a bit long to
complete (4 hours and still running). Had a brief look at cluster_index.py
<https://github.com/facebookresearch/pysparnn/blob/master/pysparnn/cluster_index.py>
but couldn't find if there's a parameter to specify the number of
workers/jobs for index creation.
Is this an expected behavior while using sklearn.metrics.pairwise.
pairwise_distances (as you probably mentioned earlier)? Or if you could
let me know the time complexity for indexing with respect to the size of
the matrix.
Thanks!
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#11 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAXLXSqJ-XamPM1UDSN-r9bOXiww_aiIks5roaS6gaJpZM4MZQu1>
.
|
Indeed. I ran the code against a small dummy dataset for validating the nearest neighbors and the custom distance values - it seems to be working as expected! This gist represents the custom distance class written (modified a value for testing purpose). After appending this class to matrix_distance.py, updated the bindings by executing For reproducing the experiment on your end, you can create a random sparse matrix of NxN (N=900000) and run
where Let me know if anything doesn't seem fine or otherwise. Thanks! |
Thanks for the detailed reply - I'll have a look tonight and get back to
you. I suspect that the sklearn distance metric may not be as performant as
we would like. I might be able to write something custom.
Thanks again!
…On Mar 22, 2017 6:10 PM, "Tejas Shah" ***@***.***> wrote:
Hi @spencebeecher <https://github.com/spencebeecher>
Indeed. I ran the code against a small dummy dataset for validating the
nearest neighbors and the custom distance values - it seems to be working
as expected! This gist
<https://gist.github.com/tejasshah93/8dc7dcbc7463f36e171d5db66355ff87>
represents the custom distance class written (modified a value for testing
purpose). After appending this class to matrix_distance.py
<https://github.com/facebookresearch/pysparnn/blob/master/pysparnn/matrix_distance.py>,
updated the bindings by executing $ python setup.py install --user from
repository folder.
For reproducing the experiment on your end, you can create a random sparse
matrix of NxN (N=900000) and run
>>> cp = ci.MultiClusterIndex(dataset, data_to_return, distance_type=pysparnn.matrix_distance.UserCustomDistance)
where data_to_return = range(dataset.shape[0])
Let me know if anything doesn't seem fine or otherwise. Thanks!
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#11 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAXLXfMfUN8Spd0Otba2sW3Ku8E0Euiyks5rocaRgaJpZM4MZQu1>
.
|
Appreciate your prompt response @spencebeecher! Sure, let me know when you're possibly done with the implementation (or notify me if I've erred somewhere while writing the distance class). Thanks! |
My sln may be faster - i tested the new impl on sample data and was able to get the same results. I am currently running this on my laptop which does not have much power. Ill let you know how log it took to index. Note that i did make the matrix rather sparse (only 10 elements per record on average) |
Hi, I'm not entirely sure what you're attempting when defining the distance metric here. Guess that's why the results are different for the sample matrices mentioned in the gist. (how did you get the same results?)
For clarifications, the custom distance value is computed as follows: However, even when I proceeded with creating the index as mentioned in the gist, it aborted with the following error:
Any headers? |
Whoopse - you are totally right! I had a bug that i must have introduced. Try this code. I haven only tried it on 100k records not 1mil. https://gist.github.com/spencebeecher/04e79631e9725ea25889a53117319412 (same as the gist just with notebook output) tej_notebook.pdf |
Hi @spencebeecher , I tried the above mentioned modified code for 1 million records and it executed in just about 225 seconds! (where As you mentioned, the results are same for sample matrices this time. However, when I checked the output for sample matrices from the previous gist, it's giving different results: I'm not sure of the approach used in defining the custom distance in Thanks! |
Hi,
PySparNN looks great! I'd like to know if there is a provision for using custom distances as a metric while searching for nearest neighbors.
E.g. For sklearn.neighbors.NearestNeighbors, the documentation mentions:
Do we have such a provision for custom distances in PySparNN?
Thanks.
The text was updated successfully, but these errors were encountered: