New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improv perf #29
Improv perf #29
Conversation
Hi Nuc, Thanks for this PR! It will definitely increase the performance of discrete frechet distance. I'm going to recommend you make a couple minor changes, then I'll merge this in and publish a new release out there. As far as cython goes -- yes I'd expect cython to be even faster. I kind of like the easier to read source code right now... but I can see people needing performance. I think the best path in the future could be an optional cython installation, which could fall back to python. Thanks, |
@@ -434,7 +434,7 @@ def curve_length_measure(exp_data, num_data): | |||
return np.sqrt(np.sum(r_sq)) | |||
|
|||
|
|||
def frechet_dist(exp_data, num_data, p=2): | |||
def frechet_dist(exp_data, num_data, metric='euclidean'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you fix the flake8 errors here:
similaritymeasures/similaritymeasures.py:4:1: F401 'scipy.spatial.minkowski_distance' imported but unused
similaritymeasures/similaritymeasures.py:499:50: E251 unexpected spaces around keyword / parameter equals
similaritymeasures/similaritymeasures.py:499:52: E251 unexpected spaces around keyword / parameter equals
similaritymeasures/similaritymeasures.py:502:19: E231 missing whitespace after ','
similaritymeasures/similaritymeasures.py:504:39: E231 missing whitespace after ','
similaritymeasures/similaritymeasures.py:506:39: E231 missing whitespace after ','
similaritymeasures/similaritymeasures.py:510:31: E231 missing whitespace after ','
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sincerely I did not run flake on my machine I don't quickly know how to do it, but I think I fixed everything. I would kindly ask you to rerun it and evaluate again if standards are met.
tests/tests.py
Outdated
import sys | ||
# caution: path[0] is reserved for script path (or '' in REPL) | ||
sys.path.insert(1, '../similaritymeasures') | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These lines are not needed when you install the package. Please remove these 4 lines.
When you do this kind of package development, you should install the library from the source code. I normally clone the library, and then install the local files with pip. After making changes to the source files you then re-install them.
A lot of people find this constant installation annoying.
A good alternative is to use pip -e
which will make a symbolic link from your site packages folder to the local source code. Essentially then the library's source code becomes editable.
So with all that said, you can avoid setting paths by simply doing
git clone https://github.com/cjekel/similarity_measures.git
python -m pip install -e ./similarity_measures
which will then let you make changes to the source code and always import the latest code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the explanation on how to reinstall everything!
I just removed these lines in my latest commit
I made my best to handle the changes! Please let me know if I missed something. |
I'll do my best to answer here regarding cython. I already took the code and cythonized it in a separate folder of mine, did some quick and dirty benchmarking/profiling. The best of the performance gain came just from using distance.cdist function. That gave a gain of like 7x, up to what I recall. Then, the cythonization I did was quite light: I just used cdef to take the i and j variable (the ones used in the loops) and declared them as c integers. That made gave the code a 30% speed up to what I recall (on the already improved version using cdist...). That is to provide you with my personal experience. Readability was not that much affected I suppose! It was just two initial declarations. But the cython build process may become messy in the installation procedure I fear. Maybe a similar improvement can be obtained by just setting i and j as np.int32/np.int16 or something, but I don't know if the range will discard them and replace them with actual python ints (which slow down stuff since they are full fledged pointers to objects instead of just 4-byte integers). All this stuff up to my limited knowledge. A horrible thing that I can propose is something that would possibly require modifications to the interface: if you already calculated the distance matrix between each pair of points of the curves, you can think of passing it to frechet and dtw functions, in order to avoid recalculating it twice (in case you need both distances). I suppose sklearn allows something like this in certain clustering algorithms, like "I will calculate the distances matrix to execute the clustering algo, but if you already have it just pass it and that will be used directly". This also allows people to use peculiar distance calculation criteria. But I suggest such an idea just for the sake of it, knowing the importance of keeping the interface stable and standardized! Thank you for the time and consideration |
Code looks good to be merged. Thank you for this contribution! This is going to trigger a 1.0.0 release of this software since I think I'm going to break backwards compatibility. |
@nucccc can you make an issue on cython dtw and frechet with your points? I don't have time right now to give it much thought, but I will to help out with that! |
In my last commit I found a way to mantain backwards compatibility. It was just necessary to pass the minkowski distance as an argument to distance.cdist and then set p=p in the **kwargs of the scipy function `def frechet_dist(exp_data, num_data, p=2): ... c = distance.cdist(exp_data, num_data, metric='minkowski', p=p)` So I hope backwards compatibility shall not be broken |
woah! I didn't know that was a thing. That will make the merge and release a bit easier! LGTM. Thank you for this contribution! |
Good afternoon,
I hope this message finds you well, and I compliment and thank you for the code.
I worked with frechet and dtw, and I found that computational performances of the frechet function were subpar since they didn't use the cdist function from scipy (which i found by far more performing than the minkowski_distance one).
I took the freedom to change the code and propose you a pull request (i also modified tests code in order for it not to import the already installed package).
On my day to day job I also use cython to improve performances of python programs, and I was thinking that maybe it could benefit some of the loops in the code.
Best of wishes for whatever!
Nuc