Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PY] Compute sparse matrix when input is a point-cloud with a threshold applied #54

Merged
merged 26 commits into from
Dec 7, 2021

Conversation

MonkeyBreaker
Copy link
Collaborator

@MonkeyBreaker MonkeyBreaker commented Nov 21, 2021

This PR adds an optimization for computing the distance matrix from a point cloud directly in Python.

We observed that the computation of a distance matrix for a big point cloud (50k points), is quite expensive.
The upstream C++ ripser, when passing a point-cloud with a threshold, computes directly a sparse distance matrix with all distances above the threshold removed.

We reproduce this behavior directly in Python by using a method from scipy library.

  • Implement the method in Python
    • Support for euclidean metric
    • Support for minkowski metric
    • Support to any other metric ?
    • Use sklearn NearestNeighbors instead of scipy approach.
  • Add test that compares 2 run on the same dataset, with one using a threshold (enclosing radius) and the other with threshold.
    • Test using euclidean metric
    • Test using minkowski metric
  • Document this behavior somewhere ?
    • : Rework the documentation of X as suggested by @ulupo
  • Sanitize sparse input distance matrix
  • Fix Incorrect handling of zeros for dense input when a threshold is passed #55

julian added 3 commits November 18, 2021 15:46
If the input is a point cloud and a threshold is provided.
We will use a kd_tree structure to compute the distance matrix
producing faster and memory efficient

Signed-off-by: julian <julian.burellaperez@heig-vd.ch>
…matrix

For the moment, add only support to euclidean and minkowski metrics.

Signed-off-by: julian <julian.burellaperez@heig-vd.ch>
Signed-off-by: julian <julian.burellaperez@heig-vd.ch>
@MonkeyBreaker MonkeyBreaker added the enhancement New feature or request label Nov 21, 2021
@MonkeyBreaker MonkeyBreaker self-assigned this Nov 21, 2021
@MonkeyBreaker MonkeyBreaker added this to In progress in v0.3.0 via automation Nov 21, 2021
Signed-off-by: julian <julian.burellaperez@heig-vd.ch>
Signed-off-by: julian <julian.burellaperez@heig-vd.ch>
Copy link
Collaborator

@ulupo ulupo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this PR is a good opportunity to also improve on a couple of highly related things:

  1. Docstrings: there are a few rules about how the input X is processed that are explicitly stated in giotto-tda and should also be explicitly stated here because they are (I think) the same. Here we should be able to import the content of https://github.com/giotto-ai/giotto-tda/blob/a0f4cd4a5cf179b25c9ab7e5aee5ca889b4fc183/gtda/homology/simplicial.py#L204-L220 into the description of the X parameter.
  2. Use of _resolve_symmetry_conflicts: the only use case for this function as is written now is when metric='precomputed' and the input is sparse. In the current code, it is executed in
    row, col, data = _resolve_symmetry_conflicts(coo_matrix(dm))
    unnecessarily because the rule for dense matrices is that the lower diagonal is ignored completely, so we should just fetch the upper diagonal row, col and data values. In the same way,
    row, col, data = _resolve_symmetry_conflicts(dm.tocoo()) # Upper diag
    now gets run even if _pc_to_dm_with_threshold is applied. But _pc_to_dm_with_threshold returns a symmetric matrix, so again _resolve_symmetry_conflicts will do useless work in this case. A possible solution is to have an additional kwarg check (default True) to _resolve_symmetry_conflicts: when False, just the upper diagonal information is returned.

Copy link
Collaborator

@ulupo ulupo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest this name change.

gph/python/ripser_interface.py Outdated Show resolved Hide resolved
Julián and others added 3 commits November 22, 2021 08:34
From _pc_to_dm_with_threshold to _pc_to_sparse_dm_with_threshold

Co-authored-by: Umberto Lupo <46537483+ulupo@users.noreply.github.com>
Signed-off-by: julian <julian.burellaperez@heig-vd.ch>
Signed-off-by: julian <julian.burellaperez@heig-vd.ch>
@MonkeyBreaker
Copy link
Collaborator Author

Docstrings: there are a few rules about how the input X is processed that are explicitly stated in giotto-tda and should also be explicitly stated here because they are (I think) the same. Here we should be able to import the content of https://github.com/giotto-ai/giotto-tda/blob/a0f4cd4a5cf179b25c9ab7e5aee5ca889b4fc183/gtda/homology/simplicial.py#L204-L220 into the description of the X parameter.

If I understood you right, you would like to replace the current docstring for parameter X to the following:

            X : ndarray or list of length n_samples
            Input data representing a collection of point clouds if `metric`
            was not set to ``"precomputed"``, and of distance matrices or
            adjacency matrices of weighted undirected graphs otherwise. Can be
            either a 3D ndarray whose zeroth dimension has size ``n_samples``,
            or a list containing ``n_samples`` 2D ndarrays/sparse matrices.
            Point cloud arrays have shape ``(n_points, n_dimensions)``, and if
            `X` is a list these shapes can vary between point clouds. If
            `metric` was set to ``"precomputed"``, then:
                - Diagonal entries indicate vertex weights, i.e. the filtration
                  parameters at which vertices appear.
                - If entries of `X` are dense, only their upper diagonal
                  portions (including the diagonal) are considered.
                - If entries of `X` are sparse, they do not need to be upper
                  diagonal or symmetric. If only one of entry (i, j) and (j, i)
                  is stored, its value is taken as the weight of the undirected
                  edge {i, j}. If both are stored, the value in the upper
                  diagonal is taken. Off-diagonal entries which are not
                  explicitly stored are treated as infinite, indicating absent
                  edges.
                - Entries of `X` should be compatible with a filtration, i.e.
                  the value at index (i, j) should be no smaller than the
                  values at diagonal indices (i, i) and (j, j).

Is that right ?

@ulupo
Copy link
Collaborator

ulupo commented Nov 22, 2021

If I understood you right, you would like to replace the current docstring for parameter X to the following:

Almost. But some things need to be changed because I pulled that out of VietorisRipsPersistence which processes collections of distance matrices/point clouds. I'll adapt that text ASAP, please standby :).

@MonkeyBreaker
Copy link
Collaborator Author

Use of _resolve_symmetry_conflicts: the only use case for this function as is written now is when metric='precomputed' and the input is sparse. In the current code, it is executed in
giotto-ph/gph/python/ripser_interface.py Line 501 in 2b3d676
row, col, data = _resolve_symmetry_conflicts(coo_matrix(dm))
unnecessarily because the rule for dense matrices is that the lower diagonal is ignored completely, so we should just fetch the upper diagonal row, col and data values. In the same way,
giotto-ph/gph/python/ripser_interface.py Line 456 in 2b3d676
row, col, data = _resolve_symmetry_conflicts(dm.tocoo()) # Upper diag
now gets run even if _pc_to_dm_with_threshold is applied. But _pc_to_dm_with_threshold returns a symmetric matrix, so again _resolve_symmetry_conflicts will do useless work in this case. A possible solution is to have an additional kwarg check (default True) to _resolve_symmetry_conflicts: when False, just the upper diagonal information is returned.

I am not sure it is true, if I follow correctly the code, we call 2 times _resolve_symmetry_conflicts:

  1. row, col, data = _resolve_symmetry_conflicts(dm.tocoo()) # Upper diag
  2. row, col, data = _resolve_symmetry_conflicts(coo_matrix(dm))

The first one is computed in all cases if the input is sparse. I think that here you are right, we should not compute this if we have a sparse input because we called _pc_to_sparse_dm_with_threshold. But as you point out, I think that _pc_to_sparse_dm_with_threshold should only return the upper diagonal part, if you can confirm this, then I can proceed to the necessary changes inside of the function itself.

For the second case that _resolve_symmetry_conflicts is called, it will only be when the input a distance matrix (metric='precomputed') and a threshold is provided. But on your message I am not sure I understand the following remark:

In the current code, it is executed in ... unnecessarily because the rule for dense matrices is that the lower diagonal is ignored completely, so we should just fetch the upper diagonal row, col and data values.

So instead of calling _resolve_symmetry_conflicts, we should only remove all entries below the diagonal, right ?

@MonkeyBreaker
Copy link
Collaborator Author

Almost. But some things need to be changed because I pulled that out of VietorisRipsPersistence which processes collections of distance matrices/point clouds. I'll adapt that text ASAP, please standby :).

Sure, no stress, feel free to directly push your suggestion into the PR :)

@ulupo
Copy link
Collaborator

ulupo commented Nov 22, 2021

But on your message I am not sure I understand the following remark

This remark is about the fact that

row, col, data = _resolve_symmetry_conflicts(coo_matrix(dm))
is only called when the user passed dense input. And here the "resolution of symmetry conflicts" is trivial, we just throw away the lower diagonal.

I don't think the best approach is

_pc_to_sparse_dm_with_threshold should only return the upper diagonal part

I think the best approach is

possible solution is to have an additional kwarg check (default True) to _resolve_symmetry_conflicts: when False, just the upper diagonal information is returned.

because _resolve_symmetry_conflicts is already about making decisions about lower vs upper diagonals, so it makes sense to me that it should take full responsibility of it in all cases, if possible.

@ulupo
Copy link
Collaborator

ulupo commented Nov 22, 2021

@MonkeyBreaker in view of #55, I think we can think about fixing that issue here, what do you think?

@MonkeyBreaker
Copy link
Collaborator Author

MonkeyBreaker commented Nov 22, 2021

@MonkeyBreaker in view of #55, I think we can think about fixing that issue here, what do you think?

Sure, it makes sense because we are reworking the processing of the input :) !

because _resolve_symmetry_conflicts is already about making decisions about lower vs upper diagonals, so it makes sense to me that it should take full responsibility of it in all cases, if possible.

I understand now, I think the confusion for me is that _resolve_symmetry_conflicts does 2 jobs, resolving symmetry issues and returning the upper diagonal part only. Why not instead create another function that from a sparse input, only returns the upper diagonal part ?

This allows to call something like _extract_upper_diagonal() when we only need to do this step and call _extract_upper_diagonal() from _resolve_symmetry_conflicts directly. This allows to split the job done by function. And we won't need to add a kwarg parameter.

See comment below

@ulupo
Copy link
Collaborator

ulupo commented Nov 22, 2021

I still prefer a solution with a kwarg, as follows or similar:

def _resolve_symmetry_conflicts(coo, check=True):
    """Given a sparse matrix in COO format, filter out any entry at location
    (i, j) strictly below the diagonal if the entry at (j, i) is also
    stored. Return row, column and data information for an upper diagonal
    COO matrix."""
    _row, _col, _data = coo.row, coo.col, coo.data
    in_upper_triangle = _row <= _col
    if check:
        # Check if there is anything below the main diagonal
        if in_upper_triangle.all():
            # Initialize filtered COO data with information in the upper triangle
            row = _row[in_upper_triangle]
            col = _col[in_upper_triangle]
            data = _data[in_upper_triangle]
            # Filter out entries below the diagonal for which entries at
            # transposed positions are already available
            below_diag = np.logical_not(in_upper_triangle)
            upper_triangle_indices = set(zip(row, col))
            additions = tuple(
                zip(*((j, i, x) for (i, j, x) in zip(_row[below_diag],
                                                     _col[below_diag],
                                                     _data[below_diag])
                      if (j, i) not in upper_triangle_indices))
                )
            # Add surviving entries below the diagonal to final COO data
            if additions:
                row_add, col_add, data_add = additions
                row = np.concatenate([row, row_add])
                col = np.concatenate([col, col_add])
                data = np.concatenate([data, data_add])
            return row, col, data
        else:
            return _row, _col, _data
    else:
        row, col = _row[in_upper_triangle], _col[in_upper_triangle]
        data = _data[in_upper_triangle]
        return row, col, data

Having to make a decision about which function to use is in my view more overhead. I'd rather change the name of this function than add another one which still deals with issues of symmetry. Maybe something like _resolve_symmetry (so we don't suggest too strongly that there should be conflicts)?

@MonkeyBreaker
Copy link
Collaborator Author

Having to make a decision about which function to use is in my view more overhead.

I understand but then we need to have a function name that conveys the purpose.

Maybe something like _resolve_symmetry (so we don't suggest too strongly that there should be conflicts)?

I was more thinking something like _extract_upper(X, resolve_symmetry_conflicts=False). Then the primary purpose would be to extract the upper part. And Additionally it would be able to resolve symmetry conflicts. What do you think ?

@ulupo
Copy link
Collaborator

ulupo commented Nov 22, 2021

This allows to call something like _extract_upper_diagonal() when we only need to do this step and call _extract_upper_diagonal() from _resolve_symmetry_conflicts directly. This allows to split the job done by function. And we won't need to add a kwarg parameter.

I'm guessing you are saying that if we pass to _resolve_symmetry_conflicts something already processed by _extract_upper_diagonal then we it will just trivially return the initial COO data, and we can exploit this?

@MonkeyBreaker
Copy link
Collaborator Author

I'm guessing you are saying that if we pass to _resolve_symmetry_conflicts something already processed by _extract_upper_diagonal then we it will just trivially return the initial COO data, and we can exploit this?

No I think that for _resolve_symmetry_conflicts, after looking at the computation, it needs to know the values below the diagonal.

In an ideal case, yes, we could do both steps separately. But specifically for _resolve_symmetry_conflicts, it needs to know the values below the diagonal, therefore we cannot pass it a upper diagonal matrix only.

My previous remark won't apply to _resolve_symmetry_conflicts, I will cross-out this part to make things more clear.

@ulupo
Copy link
Collaborator

ulupo commented Nov 22, 2021

To reply to

I was more thinking something like _extract_upper(X, resolve_symmetry_conflicts=False). Then the primary purpose would be to extract the upper part. And Additionally it would be able to resolve symmetry conflicts. What do you think ?

I don't think this helps me think about things clearly. The main purpose is to "sanitize" the sparse input. Now, "sanitizing" means:

  1. extracting the upper diagonal if that's all there is
  2. "resolving conflicts" if there's stuff both above and below

The rest is just performance considerations. If we already know that there's stuff above and below (case 2) but no possibility of conflicts, we should shortcut the logic for performance reasons. This is the purpose of the extra kwarg. I agree that this can be achieved by passing _resolve_symmetry_conflicts something that has nothing below the diagonal, but this does not feel like it conveys the purpose as clearly to me.

@ulupo
Copy link
Collaborator

ulupo commented Nov 22, 2021

No I think that for _resolve_symmetry_conflicts, after looking at the computation, it needs to know the values below the diagonal.

If there's nothing below the diagonal, currently _resolve_symmetry_conflicts just returns the input unchanged.

@MonkeyBreaker
Copy link
Collaborator Author

The rest is just performance considerations. If we already know that there's stuff above and below (case 2) but no possibility of conflicts, we should shortcut the logic for performance reasons. This is the purpose of the extra kwarg. I agree that this can be achieved by passing _resolve_symmetry_conflicts something that has nothing below the diagonal, but this does not feel like it conveys the purpose as clearly to me.

XD, I think we won't agree. But I think it is subjective, if I summarize:

  • You prefer something along _resolve_symmetry_conflicts(X, only_ret_upper_diag)
  • I prefer something along _extract_upper(X, _resolve_symmetry_conflicts)

In the end as you wrote:

The main purpose is to "sanitize" the sparse input. Now, "sanitizing" means

Then, maybe something more like:

def _sanetize_sparse_input(X, only_extract_upper)

Something like that.
I prefer if you choose here, because this is a code you wrote originally.

@MonkeyBreaker
Copy link
Collaborator Author

Well, I'm not sure. If my data is 50k points and it's in 1000 dimensions, KDTree might be incredibly slow as discussed in https://en.wikipedia.org/wiki/K-d_tree#Degradation_in_performance_with_high-dimensional_data. So if I know what I am doing, I might prefer to use 20GB willingly.

Right, I did not even think about this case. In this case, yes we should add as you suggested a kwarg. I like the name use_kdtree, we should explain in the docstring the reason of this argument. Do you think we could provide a rule of thumb, or we say that only advanced user shall be concerned by this argument ?

@ulupo
Copy link
Collaborator

ulupo commented Nov 24, 2021

Do you think we could provide a rule of thumb, or we say that only advanced user shall be concerned by this argument ?

I think that would be too hard. It's enough to make the readers aware in my opinion.

@ulupo
Copy link
Collaborator

ulupo commented Nov 24, 2021

Right, I did not even think about this case. In this case, yes we should add as you suggested a kwarg. I like the name use_kdtree, we should explain in the docstring the reason of this argument

Shall I proceed with these changes? I'll ask for a review once I'm done.

@MonkeyBreaker
Copy link
Collaborator Author

Shall I proceed with these changes? I'll ask for a review once I'm done.

With pleasure !

@ulupo
Copy link
Collaborator

ulupo commented Nov 25, 2021

I pushed an approach based on scikit-learn's NearestNeighbors class and should offer the ultimate amount of flexibility. @MonkeyBreaker the logic for the automatic choice of algorithm is in these lines:
https://github.com/scikit-learn/scikit-learn/blob/0d378913be6d7e485b792ea36e9268be31ed52d0/sklearn/neighbors/_base.py#L522-L543.

I believe this shows that "brute" (compute the full distance matrix) is always chosen if the number of dimensions is larger than 15, and that kDTree is preferred otherwise. For torus (4 < 15 dimensions), BallTree is faster, but I think we should stick with the sklearn defaults, what do you think?

I realize that some of this contradicts my previous opinion that we should always avoid the dense computation by default if too much RAM is used, but at least with this solution we gain some tight integration with sklearn, full coverage of all metrics, and some degree of reasonable automation at least for low dimensions. We could always add a custom warning if both shape[0] is large and shape[1] is large, what do you think @MonkeyBreaker ?

@MonkeyBreaker
Copy link
Collaborator Author

I believe this shows that "brute" (compute the full distance matrix) is always chosen if the number of dimensions is larger than 15, and that kDTree is preferred otherwise. For torus (4 < 15 dimensions), BallTree is faster, but I think we should stick with the sklearn defaults, what do you think?

Documentation has all details about when each algorithm is selected.

I think as you that we should stick with sklearn defaults. It will much easier for us to maintain.

I realize that some of this contradicts my previous opinion that we should always avoid the dense computation by default if too much RAM is used, but at least with this solution we gain some tight integration with sklearn, full coverage of all metrics, and some degree of reasonable automation at least for low dimensions. We could always add a custom warning if both shape[0] is large and shape[1] is large, what do you think @MonkeyBreaker ?

I think that with the documentation you did in the docstring it is good enough. I don't think we can have a perfect solution but right now I think that we are good !

Let me try to benchmark current defaults to compare how better we are compared to originally.

@ulupo ulupo marked this pull request as ready for review December 6, 2021 07:58
Copy link
Collaborator

@ulupo ulupo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I missed this in #50 but max_coeff_supported should probably be elevated to a global variable defined somewhere at the top of the file. @MonkeyBreaker I can make this small change in this PR if OK with you?

@MonkeyBreaker
Copy link
Collaborator Author

I missed this in #50 but max_coeff_supported should probably be elevated to a global variable defined somewhere at the top of the file. @MonkeyBreaker I can make this small change in this PR if OK with you?

Sure, if you put it as a global variable, could you uppercase the entire variable ? I think it is a convention for Python

@ulupo
Copy link
Collaborator

ulupo commented Dec 6, 2021

could you uppercase the entire variable

Yup, I was going to do it like that.

@MonkeyBreaker
Copy link
Collaborator Author

@ulupo, any important points to resolve that I can help in order to merge this new improvement 😄 !

Btw, 2 things I verified, when using Torus, threshold 0.15 and maxdim = 2:

  • Memory peak now : 0.4GB
  • Memory peak with scipy : 0.7GB
  • Memory peak originally : 12GB

@ulupo
Copy link
Collaborator

ulupo commented Dec 7, 2021

@MonkeyBreaker thanks for the profiling! So are you formally approving this for a merge? I can't think of any more changes that need to be made.

@MonkeyBreaker
Copy link
Collaborator Author

On my side, if you address my comment on triu comment that I think it is not right.
Then it all looks good to me !

@ulupo
Copy link
Collaborator

ulupo commented Dec 7, 2021

On my side, if you address my comment on triu comment that I think it is not right.

@MonkeyBreaker I'm sorry, I don't understand what comment this is.

metric_params=metric_params,
n_jobs=n_threads,
**nearest_neighbors_params).fit(X)
# Upper triangular CSR output
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that when no format is defined for triu, it returns a COO format.
See code.

Copy link
Collaborator

@ulupo ulupo Dec 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see! Well, COO is actually better for what we want to do later, so I can just fix the docs. What do you think?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, fixing the docs seems the good idea (I am not sure if we should also explicitely specify format='coo') !
I think too that we should keep using COO whenever we can

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like being explicit so passed format="coo".

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect, I resolve, I do a last review

gph/python/test/test_ripser.py Outdated Show resolved Hide resolved
@MonkeyBreaker
Copy link
Collaborator Author

@MonkeyBreaker I'm sorry, I don't understand what comment this is.

Sorry, again, I forgot to publish them ...

@MonkeyBreaker
Copy link
Collaborator Author

LGTM !
Let's go for the merge :)

I generated the documentation on my machine, looks greats 👍

@ulupo ulupo merged commit 467463b into giotto-ai:main Dec 7, 2021
v0.3.0 automation moved this from In progress to Done Dec 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Development

Successfully merging this pull request may close these issues.

Incorrect handling of zeros for dense input when a threshold is passed
2 participants