Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mapper / Scikit-learn error from precomputed clustering #149

Closed
torlarse opened this issue Jan 7, 2020 · 24 comments
Closed

Mapper / Scikit-learn error from precomputed clustering #149

torlarse opened this issue Jan 7, 2020 · 24 comments
Assignees
Labels
enhancement New feature or request good first issue Good for newcomers
Milestone

Comments

@torlarse
Copy link

torlarse commented Jan 7, 2020

Description

I have a point cloud of 1920 nine-dimensional points. When applying Mapper with DBSCAN clustering as in the Christmas Santa notebook everything works fine. When I apply my own clustering algoritm with a precomputed distance matrix I get an error. Using Kepler Mapper I make this work by setting the parameter precomputed=True when calling mapper.map().

PS! I used the color function from the Santa .csv file as a hack to make the code run. It worked for the basic clustering method.

UPDATE: I added point_cloud.csv to the gist, I hope it works for reproduction.

Steps/Code to Reproduce

https://gist.github.com/torlarse/43604dd09a98cc3f69166659cd6ddf9e

Expected Results

A Mapper complex :)

Actual Results

Please see gist for traceback.

Versions

Python 3.7.5 (tags/v3.7.5:5c02a39a0b, Oct 15 2019, 00:11:34) [MSC v.1916 64 bit (AMD64)]
NumPy 1.17.4
SciPy 1.3.3
joblib 0.14.0
Scikit-Learn 0.22
giotto-Learn 0.1.3

@ulupo ulupo self-assigned this Jan 7, 2020
@ulupo ulupo added the good first issue Good for newcomers label Jan 7, 2020
@ulupo ulupo added this to the 0.1.4 milestone Jan 7, 2020
@ulupo
Copy link
Collaborator

ulupo commented Jan 7, 2020

Hey @torlarse, thanks for the detailed report, and for including the gist which is extremely helpful. Thanks also for powering through despite the lack of complete documentation at this pre-release stage. If I understood your example correctly, the short answer is that you should not pass affinity='precomputed' -- in fact, you can get away with not passing affinity at all since the scikit-learn default value is 'euclidean' which looks to be what you are after. The fit step is also unnecessary and un-Mapper--like because clustering must be done on each pullback cover set separately and thus it does not help to fit the clusterer on the entire point cloud to begin with. It is also computationally wasteful because distances between points in different pullback cover sets are not going to be used.

Let me clarify the structure of the kind of pipeline created by make_mapper_pipeline, starting at line https://github.com/giotto-ai/giotto-learn/blob/183c3dbb58847337b9d7a5ac1c667f9e5418cd15/giotto/mapper/pipeline.py#L181

A version of the explanation which follows will end up clearly laid out in the docs for the stable release. Let X denote the input data array, this is the structure of a Mapper pipeline in giotto-learn:

     X_scaled = scaler(X) -> Y = f(X_scaled) -> masks = cover(Y) ->  
X ->                                                                -> nodes = ParallelClustering(clusterer, X, masks) -> graph = Nerve(nodes)
     ------------------------------------------------------------->

I hope the above is mostly self-explanatory. Essentially, X is passed as is to a ParallelClustering transformer which takes the unfitted clusterer as input, selects subportions of X according to the boolean array masks, and applies a clone of the clusterer to each subportion separately (possibly in parallel using joblib if requested by the user). So if the original X is a concrete point cloud in R^d (not distance matrix), each such subportion is still a point cloud in R^d, and typically will not be a square distance matrix.

@torlarse
Copy link
Author

torlarse commented Jan 7, 2020

@ulupo thanks so much for superfast response! I hope to gain more insight in your code with time.

I need to use something Gunnar Carlsson calls Variance Normalized Euclidean distance as a clustering metric. This means taking Euclidean distance after dividing the columns in the dataset with the column variance. The only feasible way I figured how to solve that was to precompute a distance matrix for the Mapper clustering. It seems to work with Scikit-TDA, but I fully understand that Giotto is in early days of development :)

@ulupo
Copy link
Collaborator

ulupo commented Jan 7, 2020

@torlarse, thanks for the feedback which is extremely useful for our development. I now understand your question and needs I think! In fact, here's a possible solution: generalise the long bottom arrow in my diagram, which is currently restricted to be the identity function. Actually this would only require a very small intervention in the definition of make_mapper_pipeline (which is already being refactored as part of (#137)) -- essentially one would replace
https://github.com/giotto-ai/giotto-learn/blob/183c3dbb58847337b9d7a5ac1c667f9e5418cd15/giotto/mapper/pipeline.py#L187
with

            [('clustering_preprocessing', clustering_preprocessing), ('map_and_cover', map_and_cover)])),

where clustering_preprocessing would be a new parameter to pass to make_mapper_pipeline, set to be the identity function by default. In your case, you would set it to be scikit-learn's StandardScaler (passing with_mean=False to reproduce your code). I'll think about it more deeply with @lewtun in the upcoming days.

Here is a hack to make your code work for the time being: https://gist.github.com/ulupo/712b13823b826b3bd2a66096c85f4ed5

@ulupo ulupo added the enhancement New feature or request label Jan 7, 2020
@ulupo
Copy link
Collaborator

ulupo commented Jan 8, 2020

@torlarse, does this allow you to obtain the desired results and workflow?

@torlarse
Copy link
Author

torlarse commented Jan 8, 2020

@ulupo thanks so much for the assistance, it seems to be working at least for a cover of 100 2-cubes. The edge case I am bumping into now is

ValueError: Found array with 1 sample(s) (shape=(1, 1920)) 
while a minimum of 2 is required by AgglomerativeClustering.

The above Scikit-learn error code is the result of a partitioning the plane of two PCA component axes with 30^2 = 900 2-cubes. Necessarily there will be a lot of empty cubes and cubes with only 1 sample.

The error arises in the source code of Scikit-learn hierarchical.py line 793

X = check_array(X, ensure_min_samples=2, estimator=self)

and

if n_samples < ensure_min_samples:
       raise ValueError("Found array with %d sample(s) (shape=%s) while a"
                   " minimum of %d is required%s."

in validation.py line number 566.

UPDATE: so the 1-sampled array raising the error code is actually my entire point cloud, flattened to 1 point of 1920 dimensions.

If you don`t have an immediate fix I can try looking into the Mapper pipeline for clues myself :)

@ulupo
Copy link
Collaborator

ulupo commented Jan 8, 2020

Thanks again, @torlarse. There are two ways to resolve this further point in my mind:

  • Treat this behaviour as a desirable feature of our implementation: here the point of view would be that each pullback cover set should be fitted with the specified clusterer exactly as specified by the clusterer. With this view, if scikit-learn's AgglomerativeClustering throws an error on singletons, we want an error to be thrown in our Mapper implementation too.
  • shortcut the per-pullback cover fitting step in our implementation of ParallelClustering: in practice, this could easily be done by case-handling in https://github.com/giotto-ai/giotto-learn/blob/2d72b2812f38a3bde451ee6eb0b4baec6a7cee0d/giotto/mapper/cluster.py#L137

Meanwhile: have you considered passing kind='balanced' to the CubicalCover constructor? And have you considered using our implementations of agglomerative clustering with flat cuts, i.e. FirstSimpleGap and FirstHistogramGap?

@torlarse
Copy link
Author

torlarse commented Jan 8, 2020

I'll look into everything you say 👍

@ulupo
Copy link
Collaborator

ulupo commented Jan 8, 2020

I'm not sure I understand your comment:

UPDATE: so the 1-sampled array raising the error code is actually my entire point cloud, flattened to 1 point of 1920 dimensions.

My view is that this is indeed one point from your point cloud, why do you say it is the entire one, flattened?

@torlarse
Copy link
Author

torlarse commented Jan 8, 2020

My original point cloud is 1920 nine-dimensional points. When debugging in the mentioned files I saw an 1-dimensional array with 1920 points Without thorough checking I just assumed that it was my point cloud. Unless the array was a distance matrix, then my bad. I am in unfamiliar terrain now, sorry

@ulupo
Copy link
Collaborator

ulupo commented Jan 8, 2020

To reproduce, should I just re-run the same gist I sent you above?

@torlarse
Copy link
Author

torlarse commented Jan 8, 2020

I believe so, if you can load the point_cloud.csv file from my gist?

@ulupo
Copy link
Collaborator

ulupo commented Jan 8, 2020

Ok, I went through it. I can't quite reproduce. I.e., my ValueError message is

ValueError: Found array with 1 sample(s) (shape=(1, 9)) while a minimum of 2 is required by AgglomerativeClustering.

This makes sense for a point cloud in 9 dimensions. The shape in the error message you reported would suggest a serious bug in our code, however. So I'm not sure how to reproduce exactly what you had there.

UPDATE: Your shape could be explained by passing the distance matrix instead of the point cloud. But this isn't what was being done in the last gist I sent you.

@torlarse
Copy link
Author

torlarse commented Jan 8, 2020

sorry, your observation is spot on. I had been juggling a notebook and the python file, also testing precomputed metric passing the distance matrix. After I pass the point cloud itself to the graph pipeline I got the same error code as you. User error.

@ulupo
Copy link
Collaborator

ulupo commented Jan 8, 2020

Ok, thanks for double checking!

@ulupo
Copy link
Collaborator

ulupo commented Jan 10, 2020

Hi again @torlarse, just wanted to make sure a small thing was clear. When running my gist above (although what follows was also true in your earlier version), the PCA step is applied to the unnormalised data -- whereas, as we discussed already, the clustering is done using the Variance Normalized Euclidean distance as you required. Is this your desired behaviour? The alternative would have been to apply the PCA step to the variance normalised data. In fact, variance (or min-max) normalisation of the data is often recommended (for good reason) before applying PCA.

@torlarse
Copy link
Author

@ulupo , I sincerely appreciate your scrutiny of my code. I am trying to faithfully reproduce the work of Carlsson and Gabrielsson in 1. In another file or module I mean-center and normalize the point cloud of 6400 nine-dimensional points before applying a specific density filtration. This is not shown in my gist, but should be implemented in the point_cloud.csv if I have done things correctly.

@lewtun
Copy link
Collaborator

lewtun commented Jan 12, 2020

@torlarse would you say that this issue is now resolved?

@torlarse
Copy link
Author

@lewtun thanks for the reminder. Yes the issue is resolved.

@ulupo
Copy link
Collaborator

ulupo commented Jan 12, 2020

Thanks @torlarse! @lewtun, let's remember to add the most promising ideas for enhancements arising from this discussion to the development backlog. Then, I'll be happy to close the issue.

ulupo pushed a commit to ulupo/giotto-tda that referenced this issue Jan 14, 2020
Add optional transformer kwarg to make_mapper_pipeline to allow advanced users to transform data so ListFeatureUnion returns the transformed data in the first entry. Feature follows request in (giotto-ai#149)
@torlarse
Copy link
Author

torlarse commented Jan 25, 2020

@ulupo I am updating after testing release 0.1.4.

As a reminder you kindly provided a work-around

# Hack into the nested structure of pipeline to add a scaling step which only affects the clustering
from sklearn.preprocessing import StandardScaler
clustering_scaler = StandardScaler(with_mean=False)
pipeline_vne.set_params(pullback_cover__identity=clustering_scaler)

This proceure now triggers the following traceback. Sorry for the long output.

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-37-1f5129bd3897> in <module>
      2 from sklearn.preprocessing import StandardScaler
      3 clustering_scaler = StandardScaler(with_mean=False)
----> 4 pipeline_vne.set_params(pullback_cover__identity=clustering_scaler)

c:\users\teril\gio-pip-2\lib\site-packages\gtda\mapper\pipeline.py in set_params(self, **kwargs)
    117                    for key in mapper_nerve_kwargs})
    118             [kwargs.pop(key) for key in mapper_nerve_kwargs]
--> 119         super().set_params(**kwargs)
    120         return self
    121 

c:\users\teril\gio-pip-2\lib\site-packages\sklearn\pipeline.py in set_params(self, **kwargs)
    161         self
    162         """
--> 163         self._set_params('steps', **kwargs)
    164         return self
    165 

c:\users\teril\gio-pip-2\lib\site-packages\sklearn\utils\metaestimators.py in _set_params(self, attr, **params)
     48                 self._replace_estimator(attr, name, params.pop(name))
     49         # 3. Step parameters and other initialisation arguments
---> 50         super().set_params(**params)
     51         return self
     52 

c:\users\teril\gio-pip-2\lib\site-packages\sklearn\base.py in set_params(self, **params)
    243 
    244         for key, sub_params in nested_params.items():
--> 245             valid_params[key].set_params(**sub_params)
    246 
    247         return self

c:\users\teril\gio-pip-2\lib\site-packages\sklearn\pipeline.py in set_params(self, **kwargs)
    839         self
    840         """
--> 841         self._set_params('transformer_list', **kwargs)
    842         return self
    843 

c:\users\teril\gio-pip-2\lib\site-packages\sklearn\utils\metaestimators.py in _set_params(self, attr, **params)
     48                 self._replace_estimator(attr, name, params.pop(name))
     49         # 3. Step parameters and other initialisation arguments
---> 50         super().set_params(**params)
     51         return self
     52 

c:\users\teril\gio-pip-2\lib\site-packages\sklearn\base.py in set_params(self, **params)
    234                                  'Check the list of available parameters '
    235                                  'with `estimator.get_params().keys()`.' %
--> 236                                  (key, self))
    237 
    238             if delim:

ValueError: Invalid parameter identity for estimator ListFeatureUnion(n_jobs=None,
                 transformer_list=[('clustering_preprocessing',
                                    FunctionTransformer(accept_sparse=False,
                                                        check_inverse=True,
                                                        func=None,
                                                        inv_kw_args=None,
                                                        inverse_func=None,
                                                        kw_args=None,
                                                        validate=True)),
                                   ('map_and_cover',
                                    Pipeline(memory=None,
                                             steps=[('scaler',
                                                     FunctionTransformer(accept_sparse=False,
                                                                         check_inverse=True,
                                                                         func=None,
                                                                         inv_kw_args=None,
                                                                         inverse_func=None,
                                                                         kw_args=None,
                                                                         validate=False)),
                                                    ('filter_func',
                                                     PCA(copy=True,
                                                         iterated_power='auto',
                                                         n_components=2,
                                                         random_state=None,
                                                         svd_solver='auto',
                                                         tol=0.0,
                                                         whiten=False)),
                                                    ('cover',
                                                     CubicalCover(kind='uniform',
                                                                  n_intervals=30,
                                                                  overlap_frac=0.66))],
                                             verbose=True))],
                 transformer_weights=None, verbose=False). Check the list of available parameters with `estimator.get_params().keys()`.

@ulupo
Copy link
Collaborator

ulupo commented Jan 25, 2020

Hi @torlarse, thanks for getting back! The hack is now no longer necessary as we have implemented the clustering_preprocessing step I mentioned above. See https://docs-tda.giotto.ai/generated/gtda.mapper.pipeline.make_mapper_pipeline.html and https://docs-tda.giotto.ai/mapper_pipeline.svg. Let me know if this helps!

@torlarse
Copy link
Author

@ulupo wow, you guys are doing an awesome job.

@ulupo ulupo closed this as completed Feb 10, 2020
@Delamater
Copy link

Hi @torlarse, thanks for getting back! The hack is now no longer necessary as we have implemented the clustering_preprocessing step I mentioned above. See https://docs-tda.giotto.ai/generated/gtda.mapper.pipeline.make_mapper_pipeline.html and https://docs-tda.giotto.ai/mapper_pipeline.svg. Let me know if this helps!

These links return a 404 error for me.

@ulupo
Copy link
Collaborator

ulupo commented Jun 5, 2021

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

4 participants