Mapper / Scikit-learn error from precomputed clustering #149

torlarse · 2020-01-07T15:50:21Z

Description

I have a point cloud of 1920 nine-dimensional points. When applying Mapper with DBSCAN clustering as in the Christmas Santa notebook everything works fine. When I apply my own clustering algoritm with a precomputed distance matrix I get an error. Using Kepler Mapper I make this work by setting the parameter precomputed=True when calling mapper.map().

PS! I used the color function from the Santa .csv file as a hack to make the code run. It worked for the basic clustering method.

UPDATE: I added point_cloud.csv to the gist, I hope it works for reproduction.

Steps/Code to Reproduce

https://gist.github.com/torlarse/43604dd09a98cc3f69166659cd6ddf9e

Expected Results

A Mapper complex :)

Actual Results

Please see gist for traceback.

Versions

Python 3.7.5 (tags/v3.7.5:5c02a39a0b, Oct 15 2019, 00:11:34) [MSC v.1916 64 bit (AMD64)]
NumPy 1.17.4
SciPy 1.3.3
joblib 0.14.0
Scikit-Learn 0.22
giotto-Learn 0.1.3

The text was updated successfully, but these errors were encountered:

ulupo · 2020-01-07T16:42:53Z

Hey @torlarse, thanks for the detailed report, and for including the gist which is extremely helpful. Thanks also for powering through despite the lack of complete documentation at this pre-release stage. If I understood your example correctly, the short answer is that you should not pass affinity='precomputed' -- in fact, you can get away with not passing affinity at all since the scikit-learn default value is 'euclidean' which looks to be what you are after. The fit step is also unnecessary and un-Mapper--like because clustering must be done on each pullback cover set separately and thus it does not help to fit the clusterer on the entire point cloud to begin with. It is also computationally wasteful because distances between points in different pullback cover sets are not going to be used.

Let me clarify the structure of the kind of pipeline created by make_mapper_pipeline, starting at line https://github.com/giotto-ai/giotto-learn/blob/183c3dbb58847337b9d7a5ac1c667f9e5418cd15/giotto/mapper/pipeline.py#L181

A version of the explanation which follows will end up clearly laid out in the docs for the stable release. Let X denote the input data array, this is the structure of a Mapper pipeline in giotto-learn:

     X_scaled = scaler(X) -> Y = f(X_scaled) -> masks = cover(Y) ->  
X ->                                                                -> nodes = ParallelClustering(clusterer, X, masks) -> graph = Nerve(nodes)
     ------------------------------------------------------------->

I hope the above is mostly self-explanatory. Essentially, X is passed as is to a ParallelClustering transformer which takes the unfitted clusterer as input, selects subportions of X according to the boolean array masks, and applies a clone of the clusterer to each subportion separately (possibly in parallel using joblib if requested by the user). So if the original X is a concrete point cloud in R^d (not distance matrix), each such subportion is still a point cloud in R^d, and typically will not be a square distance matrix.

torlarse · 2020-01-07T17:15:14Z

@ulupo thanks so much for superfast response! I hope to gain more insight in your code with time.

I need to use something Gunnar Carlsson calls Variance Normalized Euclidean distance as a clustering metric. This means taking Euclidean distance after dividing the columns in the dataset with the column variance. The only feasible way I figured how to solve that was to precompute a distance matrix for the Mapper clustering. It seems to work with Scikit-TDA, but I fully understand that Giotto is in early days of development :)

ulupo · 2020-01-07T20:50:31Z

@torlarse, thanks for the feedback which is extremely useful for our development. I now understand your question and needs I think! In fact, here's a possible solution: generalise the long bottom arrow in my diagram, which is currently restricted to be the identity function. Actually this would only require a very small intervention in the definition of make_mapper_pipeline (which is already being refactored as part of (#137)) -- essentially one would replace
https://github.com/giotto-ai/giotto-learn/blob/183c3dbb58847337b9d7a5ac1c667f9e5418cd15/giotto/mapper/pipeline.py#L187
with

            [('clustering_preprocessing', clustering_preprocessing), ('map_and_cover', map_and_cover)])),

where clustering_preprocessing would be a new parameter to pass to make_mapper_pipeline, set to be the identity function by default. In your case, you would set it to be scikit-learn's StandardScaler (passing with_mean=False to reproduce your code). I'll think about it more deeply with @lewtun in the upcoming days.

Here is a hack to make your code work for the time being: https://gist.github.com/ulupo/712b13823b826b3bd2a66096c85f4ed5

ulupo · 2020-01-08T11:01:45Z

@torlarse, does this allow you to obtain the desired results and workflow?

torlarse · 2020-01-08T15:26:16Z

@ulupo thanks so much for the assistance, it seems to be working at least for a cover of 100 2-cubes. The edge case I am bumping into now is

ValueError: Found array with 1 sample(s) (shape=(1, 1920)) 
while a minimum of 2 is required by AgglomerativeClustering.

The above Scikit-learn error code is the result of a partitioning the plane of two PCA component axes with 30^2 = 900 2-cubes. Necessarily there will be a lot of empty cubes and cubes with only 1 sample.

The error arises in the source code of Scikit-learn hierarchical.py line 793

X = check_array(X, ensure_min_samples=2, estimator=self)

and

if n_samples < ensure_min_samples:
       raise ValueError("Found array with %d sample(s) (shape=%s) while a"
                   " minimum of %d is required%s."

in validation.py line number 566.

UPDATE: so the 1-sampled array raising the error code is actually my entire point cloud, flattened to 1 point of 1920 dimensions.

If you don`t have an immediate fix I can try looking into the Mapper pipeline for clues myself :)

ulupo · 2020-01-08T15:39:21Z

Thanks again, @torlarse. There are two ways to resolve this further point in my mind:

Treat this behaviour as a desirable feature of our implementation: here the point of view would be that each pullback cover set should be fitted with the specified clusterer exactly as specified by the clusterer. With this view, if scikit-learn's AgglomerativeClustering throws an error on singletons, we want an error to be thrown in our Mapper implementation too.
shortcut the per-pullback cover fitting step in our implementation of ParallelClustering: in practice, this could easily be done by case-handling in https://github.com/giotto-ai/giotto-learn/blob/2d72b2812f38a3bde451ee6eb0b4baec6a7cee0d/giotto/mapper/cluster.py#L137

Meanwhile: have you considered passing kind='balanced' to the CubicalCover constructor? And have you considered using our implementations of agglomerative clustering with flat cuts, i.e. FirstSimpleGap and FirstHistogramGap?

torlarse · 2020-01-08T15:42:16Z

I'll look into everything you say 👍

ulupo · 2020-01-08T15:47:52Z

I'm not sure I understand your comment:

UPDATE: so the 1-sampled array raising the error code is actually my entire point cloud, flattened to 1 point of 1920 dimensions.

My view is that this is indeed one point from your point cloud, why do you say it is the entire one, flattened?

torlarse · 2020-01-08T16:04:47Z

My original point cloud is 1920 nine-dimensional points. When debugging in the mentioned files I saw an 1-dimensional array with 1920 points Without thorough checking I just assumed that it was my point cloud. Unless the array was a distance matrix, then my bad. I am in unfamiliar terrain now, sorry

ulupo · 2020-01-08T16:16:47Z

To reproduce, should I just re-run the same gist I sent you above?

torlarse · 2020-01-08T16:19:34Z

I believe so, if you can load the point_cloud.csv file from my gist?

ulupo · 2020-01-08T16:40:55Z

Ok, I went through it. I can't quite reproduce. I.e., my ValueError message is

ValueError: Found array with 1 sample(s) (shape=(1, 9)) while a minimum of 2 is required by AgglomerativeClustering.

This makes sense for a point cloud in 9 dimensions. The shape in the error message you reported would suggest a serious bug in our code, however. So I'm not sure how to reproduce exactly what you had there.

UPDATE: Your shape could be explained by passing the distance matrix instead of the point cloud. But this isn't what was being done in the last gist I sent you.

torlarse · 2020-01-08T16:51:33Z

sorry, your observation is spot on. I had been juggling a notebook and the python file, also testing precomputed metric passing the distance matrix. After I pass the point cloud itself to the graph pipeline I got the same error code as you. User error.

ulupo · 2020-01-08T16:53:36Z

Ok, thanks for double checking!

ulupo · 2020-01-10T08:40:43Z

Hi again @torlarse, just wanted to make sure a small thing was clear. When running my gist above (although what follows was also true in your earlier version), the PCA step is applied to the unnormalised data -- whereas, as we discussed already, the clustering is done using the Variance Normalized Euclidean distance as you required. Is this your desired behaviour? The alternative would have been to apply the PCA step to the variance normalised data. In fact, variance (or min-max) normalisation of the data is often recommended (for good reason) before applying PCA.

torlarse · 2020-01-10T10:05:47Z

@ulupo , I sincerely appreciate your scrutiny of my code. I am trying to faithfully reproduce the work of Carlsson and Gabrielsson in 1. In another file or module I mean-center and normalize the point cloud of 6400 nine-dimensional points before applying a specific density filtration. This is not shown in my gist, but should be implemented in the point_cloud.csv if I have done things correctly.

lewtun · 2020-01-12T10:54:51Z

@torlarse would you say that this issue is now resolved?

torlarse · 2020-01-12T11:20:50Z

@lewtun thanks for the reminder. Yes the issue is resolved.

ulupo · 2020-01-12T11:22:44Z

Thanks @torlarse! @lewtun, let's remember to add the most promising ideas for enhancements arising from this discussion to the development backlog. Then, I'll be happy to close the issue.

Add optional transformer kwarg to make_mapper_pipeline to allow advanced users to transform data so ListFeatureUnion returns the transformed data in the first entry. Feature follows request in (giotto-ai#149)

torlarse · 2020-01-25T10:08:23Z

@ulupo I am updating after testing release 0.1.4.

As a reminder you kindly provided a work-around

# Hack into the nested structure of pipeline to add a scaling step which only affects the clustering
from sklearn.preprocessing import StandardScaler
clustering_scaler = StandardScaler(with_mean=False)
pipeline_vne.set_params(pullback_cover__identity=clustering_scaler)

This proceure now triggers the following traceback. Sorry for the long output.

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-37-1f5129bd3897> in <module>
      2 from sklearn.preprocessing import StandardScaler
      3 clustering_scaler = StandardScaler(with_mean=False)
----> 4 pipeline_vne.set_params(pullback_cover__identity=clustering_scaler)

c:\users\teril\gio-pip-2\lib\site-packages\gtda\mapper\pipeline.py in set_params(self, **kwargs)
    117                    for key in mapper_nerve_kwargs})
    118             [kwargs.pop(key) for key in mapper_nerve_kwargs]
--> 119         super().set_params(**kwargs)
    120         return self
    121 

c:\users\teril\gio-pip-2\lib\site-packages\sklearn\pipeline.py in set_params(self, **kwargs)
    161         self
    162         """
--> 163         self._set_params('steps', **kwargs)
    164         return self
    165 

c:\users\teril\gio-pip-2\lib\site-packages\sklearn\utils\metaestimators.py in _set_params(self, attr, **params)
     48                 self._replace_estimator(attr, name, params.pop(name))
     49         # 3. Step parameters and other initialisation arguments
---> 50         super().set_params(**params)
     51         return self
     52 

c:\users\teril\gio-pip-2\lib\site-packages\sklearn\base.py in set_params(self, **params)
    243 
    244         for key, sub_params in nested_params.items():
--> 245             valid_params[key].set_params(**sub_params)
    246 
    247         return self

c:\users\teril\gio-pip-2\lib\site-packages\sklearn\pipeline.py in set_params(self, **kwargs)
    839         self
    840         """
--> 841         self._set_params('transformer_list', **kwargs)
    842         return self
    843 

c:\users\teril\gio-pip-2\lib\site-packages\sklearn\utils\metaestimators.py in _set_params(self, attr, **params)
     48                 self._replace_estimator(attr, name, params.pop(name))
     49         # 3. Step parameters and other initialisation arguments
---> 50         super().set_params(**params)
     51         return self
     52 

c:\users\teril\gio-pip-2\lib\site-packages\sklearn\base.py in set_params(self, **params)
    234                                  'Check the list of available parameters '
    235                                  'with `estimator.get_params().keys()`.' %
--> 236                                  (key, self))
    237 
    238             if delim:

ValueError: Invalid parameter identity for estimator ListFeatureUnion(n_jobs=None,
                 transformer_list=[('clustering_preprocessing',
                                    FunctionTransformer(accept_sparse=False,
                                                        check_inverse=True,
                                                        func=None,
                                                        inv_kw_args=None,
                                                        inverse_func=None,
                                                        kw_args=None,
                                                        validate=True)),
                                   ('map_and_cover',
                                    Pipeline(memory=None,
                                             steps=[('scaler',
                                                     FunctionTransformer(accept_sparse=False,
                                                                         check_inverse=True,
                                                                         func=None,
                                                                         inv_kw_args=None,
                                                                         inverse_func=None,
                                                                         kw_args=None,
                                                                         validate=False)),
                                                    ('filter_func',
                                                     PCA(copy=True,
                                                         iterated_power='auto',
                                                         n_components=2,
                                                         random_state=None,
                                                         svd_solver='auto',
                                                         tol=0.0,
                                                         whiten=False)),
                                                    ('cover',
                                                     CubicalCover(kind='uniform',
                                                                  n_intervals=30,
                                                                  overlap_frac=0.66))],
                                             verbose=True))],
                 transformer_weights=None, verbose=False). Check the list of available parameters with `estimator.get_params().keys()`.

ulupo · 2020-01-25T10:11:01Z

Hi @torlarse, thanks for getting back! The hack is now no longer necessary as we have implemented the clustering_preprocessing step I mentioned above. See https://docs-tda.giotto.ai/generated/gtda.mapper.pipeline.make_mapper_pipeline.html and https://docs-tda.giotto.ai/mapper_pipeline.svg. Let me know if this helps!

torlarse · 2020-01-25T10:49:23Z

@ulupo wow, you guys are doing an awesome job.

Delamater · 2021-06-04T23:41:47Z

Hi @torlarse, thanks for getting back! The hack is now no longer necessary as we have implemented the clustering_preprocessing step I mentioned above. See https://docs-tda.giotto.ai/generated/gtda.mapper.pipeline.make_mapper_pipeline.html and https://docs-tda.giotto.ai/mapper_pipeline.svg. Let me know if this helps!

These links return a 404 error for me.

ulupo · 2021-06-05T12:58:49Z

@Delamater thanks for reporting this. Please use these up-to-date links: https://giotto-ai.github.io/gtda-docs/0.4.0/modules/generated/mapper/pipeline/gtda.mapper.make_mapper_pipeline.html and https://giotto-ai.github.io/gtda-docs/0.4.0/_images/mapper_pipeline.svg

ulupo self-assigned this Jan 7, 2020

ulupo added the good first issue Good for newcomers label Jan 7, 2020

ulupo added this to the 0.1.4 milestone Jan 7, 2020

ulupo added the enhancement New feature or request label Jan 7, 2020

ulupo closed this as completed Feb 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mapper / Scikit-learn error from precomputed clustering #149

Mapper / Scikit-learn error from precomputed clustering #149

torlarse commented Jan 7, 2020 •

edited

ulupo commented Jan 7, 2020 •

edited

torlarse commented Jan 7, 2020

ulupo commented Jan 7, 2020

ulupo commented Jan 8, 2020

torlarse commented Jan 8, 2020 •

edited

ulupo commented Jan 8, 2020 •

edited

torlarse commented Jan 8, 2020

ulupo commented Jan 8, 2020

torlarse commented Jan 8, 2020

ulupo commented Jan 8, 2020

torlarse commented Jan 8, 2020

ulupo commented Jan 8, 2020 •

edited

torlarse commented Jan 8, 2020

ulupo commented Jan 8, 2020

ulupo commented Jan 10, 2020

torlarse commented Jan 10, 2020

lewtun commented Jan 12, 2020

torlarse commented Jan 12, 2020

ulupo commented Jan 12, 2020

torlarse commented Jan 25, 2020 •

edited

ulupo commented Jan 25, 2020

torlarse commented Jan 25, 2020

Delamater commented Jun 4, 2021

ulupo commented Jun 5, 2021

Mapper / Scikit-learn error from precomputed clustering #149

Mapper / Scikit-learn error from precomputed clustering #149

Comments

torlarse commented Jan 7, 2020 • edited

Description

Steps/Code to Reproduce

Expected Results

Actual Results

Versions

ulupo commented Jan 7, 2020 • edited

torlarse commented Jan 7, 2020

ulupo commented Jan 7, 2020

ulupo commented Jan 8, 2020

torlarse commented Jan 8, 2020 • edited

ulupo commented Jan 8, 2020 • edited

torlarse commented Jan 8, 2020

ulupo commented Jan 8, 2020

torlarse commented Jan 8, 2020

ulupo commented Jan 8, 2020

torlarse commented Jan 8, 2020

ulupo commented Jan 8, 2020 • edited

torlarse commented Jan 8, 2020

ulupo commented Jan 8, 2020

ulupo commented Jan 10, 2020

torlarse commented Jan 10, 2020

lewtun commented Jan 12, 2020

torlarse commented Jan 12, 2020

ulupo commented Jan 12, 2020

torlarse commented Jan 25, 2020 • edited

ulupo commented Jan 25, 2020

torlarse commented Jan 25, 2020

Delamater commented Jun 4, 2021

ulupo commented Jun 5, 2021

torlarse commented Jan 7, 2020 •

edited

ulupo commented Jan 7, 2020 •

edited

torlarse commented Jan 8, 2020 •

edited

ulupo commented Jan 8, 2020 •

edited

ulupo commented Jan 8, 2020 •

edited

torlarse commented Jan 25, 2020 •

edited