Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improv perf #29

Merged
merged 5 commits into from Jan 14, 2023
Merged

Improv perf #29

merged 5 commits into from Jan 14, 2023

Conversation

nucccc
Copy link
Contributor

@nucccc nucccc commented Jan 5, 2023

Good afternoon,
I hope this message finds you well, and I compliment and thank you for the code.

I worked with frechet and dtw, and I found that computational performances of the frechet function were subpar since they didn't use the cdist function from scipy (which i found by far more performing than the minkowski_distance one).

I took the freedom to change the code and propose you a pull request (i also modified tests code in order for it not to import the already installed package).

On my day to day job I also use cython to improve performances of python programs, and I was thinking that maybe it could benefit some of the loops in the code.

Best of wishes for whatever!

Nuc

@cjekel
Copy link
Owner

cjekel commented Jan 7, 2023

Hi Nuc,

Thanks for this PR! It will definitely increase the performance of discrete frechet distance.

I'm going to recommend you make a couple minor changes, then I'll merge this in and publish a new release out there.

As far as cython goes -- yes I'd expect cython to be even faster. I kind of like the easier to read source code right now... but I can see people needing performance. I think the best path in the future could be an optional cython installation, which could fall back to python.

Thanks,
CJ

@@ -434,7 +434,7 @@ def curve_length_measure(exp_data, num_data):
return np.sqrt(np.sum(r_sq))


def frechet_dist(exp_data, num_data, p=2):
def frechet_dist(exp_data, num_data, metric='euclidean'):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you fix the flake8 errors here:

similaritymeasures/similaritymeasures.py:4:1: F401 'scipy.spatial.minkowski_distance' imported but unused
similaritymeasures/similaritymeasures.py:499:50: E251 unexpected spaces around keyword / parameter equals
similaritymeasures/similaritymeasures.py:499:52: E251 unexpected spaces around keyword / parameter equals
similaritymeasures/similaritymeasures.py:502:19: E231 missing whitespace after ','
similaritymeasures/similaritymeasures.py:504:39: E231 missing whitespace after ','
similaritymeasures/similaritymeasures.py:506:39: E231 missing whitespace after ','
similaritymeasures/similaritymeasures.py:510:31: E231 missing whitespace after ','

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sincerely I did not run flake on my machine I don't quickly know how to do it, but I think I fixed everything. I would kindly ask you to rerun it and evaluate again if standards are met.

tests/tests.py Outdated
Comment on lines 1 to 4
import sys
# caution: path[0] is reserved for script path (or '' in REPL)
sys.path.insert(1, '../similaritymeasures')

Copy link
Owner

@cjekel cjekel Jan 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These lines are not needed when you install the package. Please remove these 4 lines.

When you do this kind of package development, you should install the library from the source code. I normally clone the library, and then install the local files with pip. After making changes to the source files you then re-install them.

A lot of people find this constant installation annoying.

A good alternative is to use pip -e which will make a symbolic link from your site packages folder to the local source code. Essentially then the library's source code becomes editable.

So with all that said, you can avoid setting paths by simply doing

git clone https://github.com/cjekel/similarity_measures.git
python -m pip install -e ./similarity_measures

which will then let you make changes to the source code and always import the latest code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the explanation on how to reinstall everything!

I just removed these lines in my latest commit

@nucccc
Copy link
Contributor Author

nucccc commented Jan 7, 2023

I made my best to handle the changes! Please let me know if I missed something.

@nucccc
Copy link
Contributor Author

nucccc commented Jan 7, 2023

Hi Nuc,

Thanks for this PR! It will definitely increase the performance of discrete frechet distance.

I'm going to recommend you make a couple minor changes, then I'll merge this in and publish a new release out there.

As far as cython goes -- yes I'd expect cython to be even faster. I kind of like the easier to read source code right now... but I can see people needing performance. I think the best path in the future could be an optional cython installation, which could fall back to python.

Thanks, CJ

I'll do my best to answer here regarding cython. I already took the code and cythonized it in a separate folder of mine, did some quick and dirty benchmarking/profiling. The best of the performance gain came just from using distance.cdist function. That gave a gain of like 7x, up to what I recall.

Then, the cythonization I did was quite light: I just used cdef to take the i and j variable (the ones used in the loops) and declared them as c integers. That made gave the code a 30% speed up to what I recall (on the already improved version using cdist...).

That is to provide you with my personal experience. Readability was not that much affected I suppose! It was just two initial declarations. But the cython build process may become messy in the installation procedure I fear.

Maybe a similar improvement can be obtained by just setting i and j as np.int32/np.int16 or something, but I don't know if the range will discard them and replace them with actual python ints (which slow down stuff since they are full fledged pointers to objects instead of just 4-byte integers).

All this stuff up to my limited knowledge.

A horrible thing that I can propose is something that would possibly require modifications to the interface: if you already calculated the distance matrix between each pair of points of the curves, you can think of passing it to frechet and dtw functions, in order to avoid recalculating it twice (in case you need both distances). I suppose sklearn allows something like this in certain clustering algorithms, like "I will calculate the distances matrix to execute the clustering algo, but if you already have it just pass it and that will be used directly". This also allows people to use peculiar distance calculation criteria. But I suggest such an idea just for the sake of it, knowing the importance of keeping the interface stable and standardized!

Thank you for the time and consideration

@cjekel
Copy link
Owner

cjekel commented Jan 9, 2023

I made my best to handle the changes! Please let me know if I missed something.

Code looks good to be merged. Thank you for this contribution!

This is going to trigger a 1.0.0 release of this software since I think I'm going to break backwards compatibility.

@cjekel
Copy link
Owner

cjekel commented Jan 9, 2023

@nucccc can you make an issue on cython dtw and frechet with your points? I don't have time right now to give it much thought, but I will to help out with that!

@nucccc
Copy link
Contributor Author

nucccc commented Jan 9, 2023

In my last commit I found a way to mantain backwards compatibility. It was just necessary to pass the minkowski distance as an argument to distance.cdist and then set p=p in the **kwargs of the scipy function

`def frechet_dist(exp_data, num_data, p=2):

...

c = distance.cdist(exp_data, num_data, metric='minkowski', p=p)`

So I hope backwards compatibility shall not be broken

@nucccc nucccc mentioned this pull request Jan 9, 2023
@cjekel
Copy link
Owner

cjekel commented Jan 14, 2023

In my last commit I found a way to mantain backwards compatibility. It was just necessary to pass the minkowski distance as an argument to distance.cdist and then set p=p in the **kwargs of the scipy function

woah! I didn't know that was a thing. That will make the merge and release a bit easier!

LGTM. Thank you for this contribution!

@cjekel cjekel merged commit aa8227c into cjekel:master Jan 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants