-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[PY] Compute sparse matrix when input is a point-cloud with a threshold applied #54
[PY] Compute sparse matrix when input is a point-cloud with a threshold applied #54
Conversation
If the input is a point cloud and a threshold is provided. We will use a kd_tree structure to compute the distance matrix producing faster and memory efficient Signed-off-by: julian <julian.burellaperez@heig-vd.ch>
…matrix For the moment, add only support to euclidean and minkowski metrics. Signed-off-by: julian <julian.burellaperez@heig-vd.ch>
Signed-off-by: julian <julian.burellaperez@heig-vd.ch>
Signed-off-by: julian <julian.burellaperez@heig-vd.ch>
Signed-off-by: julian <julian.burellaperez@heig-vd.ch>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this PR is a good opportunity to also improve on a couple of highly related things:
- Docstrings: there are a few rules about how the input
X
is processed that are explicitly stated ingiotto-tda
and should also be explicitly stated here because they are (I think) the same. Here we should be able to import the content of https://github.com/giotto-ai/giotto-tda/blob/a0f4cd4a5cf179b25c9ab7e5aee5ca889b4fc183/gtda/homology/simplicial.py#L204-L220 into the description of theX
parameter. - Use of
_resolve_symmetry_conflicts
: the only use case for this function as is written now is whenmetric='precomputed'
and the input is sparse. In the current code, it is executed ingiotto-ph/gph/python/ripser_interface.py
Line 501 in 2b3d676
row, col, data = _resolve_symmetry_conflicts(coo_matrix(dm)) row
,col
anddata
values. In the same way,giotto-ph/gph/python/ripser_interface.py
Line 456 in 2b3d676
row, col, data = _resolve_symmetry_conflicts(dm.tocoo()) # Upper diag _pc_to_dm_with_threshold
is applied. But_pc_to_dm_with_threshold
returns a symmetric matrix, so again_resolve_symmetry_conflicts
will do useless work in this case. A possible solution is to have an additional kwargcheck
(defaultTrue
) to_resolve_symmetry_conflicts
: whenFalse
, just the upper diagonal information is returned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest this name change.
From _pc_to_dm_with_threshold to _pc_to_sparse_dm_with_threshold Co-authored-by: Umberto Lupo <46537483+ulupo@users.noreply.github.com>
Signed-off-by: julian <julian.burellaperez@heig-vd.ch>
Signed-off-by: julian <julian.burellaperez@heig-vd.ch>
If I understood you right, you would like to replace the current docstring for parameter X to the following:
Is that right ? |
Almost. But some things need to be changed because I pulled that out of |
I am not sure it is true, if I follow correctly the code, we call 2 times
The first one is computed in all cases if the input is sparse. I think that here you are right, we should not compute this if we have a sparse input because we called For the second case that
So instead of calling |
Sure, no stress, feel free to directly push your suggestion into the PR :) |
This remark is about the fact that giotto-ph/gph/python/ripser_interface.py Line 501 in 2b3d676
I don't think the best approach is
I think the best approach is
because |
@MonkeyBreaker in view of #55, I think we can think about fixing that issue here, what do you think? |
Sure, it makes sense because we are reworking the processing of the input :) !
I understand now, I think the confusion for me is that
See comment below |
I still prefer a solution with a kwarg, as follows or similar: def _resolve_symmetry_conflicts(coo, check=True):
"""Given a sparse matrix in COO format, filter out any entry at location
(i, j) strictly below the diagonal if the entry at (j, i) is also
stored. Return row, column and data information for an upper diagonal
COO matrix."""
_row, _col, _data = coo.row, coo.col, coo.data
in_upper_triangle = _row <= _col
if check:
# Check if there is anything below the main diagonal
if in_upper_triangle.all():
# Initialize filtered COO data with information in the upper triangle
row = _row[in_upper_triangle]
col = _col[in_upper_triangle]
data = _data[in_upper_triangle]
# Filter out entries below the diagonal for which entries at
# transposed positions are already available
below_diag = np.logical_not(in_upper_triangle)
upper_triangle_indices = set(zip(row, col))
additions = tuple(
zip(*((j, i, x) for (i, j, x) in zip(_row[below_diag],
_col[below_diag],
_data[below_diag])
if (j, i) not in upper_triangle_indices))
)
# Add surviving entries below the diagonal to final COO data
if additions:
row_add, col_add, data_add = additions
row = np.concatenate([row, row_add])
col = np.concatenate([col, col_add])
data = np.concatenate([data, data_add])
return row, col, data
else:
return _row, _col, _data
else:
row, col = _row[in_upper_triangle], _col[in_upper_triangle]
data = _data[in_upper_triangle]
return row, col, data Having to make a decision about which function to use is in my view more overhead. I'd rather change the name of this function than add another one which still deals with issues of symmetry. Maybe something like |
I understand but then we need to have a function name that conveys the purpose.
I was more thinking something like |
I'm guessing you are saying that if we pass to |
No I think that for In an ideal case, yes, we could do both steps separately. But specifically for My previous remark won't apply to |
To reply to
I don't think this helps me think about things clearly. The main purpose is to "sanitize" the sparse input. Now, "sanitizing" means:
The rest is just performance considerations. If we already know that there's stuff above and below (case 2) but no possibility of conflicts, we should shortcut the logic for performance reasons. This is the purpose of the extra |
If there's nothing below the diagonal, currently |
XD, I think we won't agree. But I think it is subjective, if I summarize:
In the end as you wrote:
Then, maybe something more like: def _sanetize_sparse_input(X, only_extract_upper) Something like that. |
Right, I did not even think about this case. In this case, yes we should add as you suggested a |
I think that would be too hard. It's enough to make the readers aware in my opinion. |
Shall I proceed with these changes? I'll ask for a review once I'm done. |
With pleasure ! |
I pushed an approach based on I believe this shows that I realize that some of this contradicts my previous opinion that we should always avoid the dense computation by default if too much RAM is used, but at least with this solution we gain some tight integration with |
Documentation has all details about when each algorithm is selected. I think as you that we should stick with
I think that with the documentation you did in the docstring it is good enough. I don't think we can have a perfect solution but right now I think that we are good ! Let me try to benchmark current defaults to compare how better we are compared to originally. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I missed this in #50 but max_coeff_supported
should probably be elevated to a global variable defined somewhere at the top of the file. @MonkeyBreaker I can make this small change in this PR if OK with you?
Sure, if you put it as a global variable, could you uppercase the entire variable ? I think it is a convention for Python |
Yup, I was going to do it like that. |
Signed-off-by: julian <julian.burellaperez@heig-vd.ch>
@ulupo, any important points to resolve that I can help in order to merge this new improvement 😄 ! Btw, 2 things I verified, when using Torus, threshold 0.15 and maxdim = 2:
|
@MonkeyBreaker thanks for the profiling! So are you formally approving this for a merge? I can't think of any more changes that need to be made. |
On my side, if you address my comment on |
@MonkeyBreaker I'm sorry, I don't understand what comment this is. |
gph/python/ripser_interface.py
Outdated
metric_params=metric_params, | ||
n_jobs=n_threads, | ||
**nearest_neighbors_params).fit(X) | ||
# Upper triangular CSR output |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that when no format
is defined for triu
, it returns a COO format.
See code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see! Well, COO is actually better for what we want to do later, so I can just fix the docs. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, fixing the docs seems the good idea (I am not sure if we should also explicitely specify format='coo'
) !
I think too that we should keep using COO whenever we can
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like being explicit so passed format="coo"
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perfect, I resolve, I do a last review
Sorry, again, I forgot to publish them ... |
LGTM ! I generated the documentation on my machine, looks greats 👍 |
This PR adds an optimization for computing the distance matrix from a point cloud directly in Python.
We observed that the computation of a distance matrix for a big point cloud (50k points), is quite expensive.
The upstream C++
ripser
, when passing a point-cloud with a threshold, computes directly a sparse distance matrix with all distances above the threshold removed.We reproduce this behavior directly in Python by using a method from
scipy
library.Support foreuclidean
metricSupport forminkowski
metricSupport to any other metric ?sklearn
NearestNeighbors instead ofscipy
approach.Test usingeuclidean
metricTest usingminkowski
metric