New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mapper / Scikit-learn error from precomputed clustering #149
Comments
Hey @torlarse, thanks for the detailed report, and for including the gist which is extremely helpful. Thanks also for powering through despite the lack of complete documentation at this pre-release stage. If I understood your example correctly, the short answer is that you should not pass Let me clarify the structure of the kind of pipeline created by A version of the explanation which follows will end up clearly laid out in the docs for the stable release. Let
I hope the above is mostly self-explanatory. Essentially, |
@ulupo thanks so much for superfast response! I hope to gain more insight in your code with time. I need to use something Gunnar Carlsson calls Variance Normalized Euclidean distance as a clustering metric. This means taking Euclidean distance after dividing the columns in the dataset with the column variance. The only feasible way I figured how to solve that was to precompute a distance matrix for the Mapper clustering. It seems to work with Scikit-TDA, but I fully understand that Giotto is in early days of development :) |
@torlarse, thanks for the feedback which is extremely useful for our development. I now understand your question and needs I think! In fact, here's a possible solution: generalise the long bottom arrow in my diagram, which is currently restricted to be the identity function. Actually this would only require a very small intervention in the definition of
where Here is a hack to make your code work for the time being: https://gist.github.com/ulupo/712b13823b826b3bd2a66096c85f4ed5 |
@torlarse, does this allow you to obtain the desired results and workflow? |
@ulupo thanks so much for the assistance, it seems to be working at least for a cover of 100 2-cubes. The edge case I am bumping into now is
The above The error arises in the source code of Scikit-learn
and
in UPDATE: so the 1-sampled array raising the error code is actually my entire point cloud, flattened to 1 point of 1920 dimensions. If you don`t have an immediate fix I can try looking into the Mapper pipeline for clues myself :) |
Thanks again, @torlarse. There are two ways to resolve this further point in my mind:
Meanwhile: have you considered passing |
I'll look into everything you say 👍 |
I'm not sure I understand your comment:
My view is that this is indeed one point from your point cloud, why do you say it is the entire one, flattened? |
My original point cloud is 1920 nine-dimensional points. When debugging in the mentioned files I saw an 1-dimensional array with 1920 points Without thorough checking I just assumed that it was my point cloud. Unless the array was a distance matrix, then my bad. I am in unfamiliar terrain now, sorry |
To reproduce, should I just re-run the same gist I sent you above? |
I believe so, if you can load the |
Ok, I went through it. I can't quite reproduce. I.e., my
This makes sense for a point cloud in 9 dimensions. The shape in the error message you reported would suggest a serious bug in our code, however. So I'm not sure how to reproduce exactly what you had there. UPDATE: Your shape could be explained by passing the distance matrix instead of the point cloud. But this isn't what was being done in the last gist I sent you. |
sorry, your observation is spot on. I had been juggling a notebook and the python file, also testing precomputed metric passing the distance matrix. After I pass the point cloud itself to the graph pipeline I got the same error code as you. User error. |
Ok, thanks for double checking! |
Hi again @torlarse, just wanted to make sure a small thing was clear. When running my gist above (although what follows was also true in your earlier version), the PCA step is applied to the unnormalised data -- whereas, as we discussed already, the clustering is done using the Variance Normalized Euclidean distance as you required. Is this your desired behaviour? The alternative would have been to apply the PCA step to the variance normalised data. In fact, variance (or min-max) normalisation of the data is often recommended (for good reason) before applying PCA. |
@ulupo , I sincerely appreciate your scrutiny of my code. I am trying to faithfully reproduce the work of Carlsson and Gabrielsson in 1. In another file or module I mean-center and normalize the point cloud of 6400 nine-dimensional points before applying a specific density filtration. This is not shown in my gist, but should be implemented in the |
@torlarse would you say that this issue is now resolved? |
@lewtun thanks for the reminder. Yes the issue is resolved. |
Add optional transformer kwarg to make_mapper_pipeline to allow advanced users to transform data so ListFeatureUnion returns the transformed data in the first entry. Feature follows request in (giotto-ai#149)
@ulupo I am updating after testing release As a reminder you kindly provided a work-around
This proceure now triggers the following traceback. Sorry for the long output.
|
Hi @torlarse, thanks for getting back! The hack is now no longer necessary as we have implemented the |
@ulupo wow, you guys are doing an awesome job. |
These links return a 404 error for me. |
@Delamater thanks for reporting this. Please use these up-to-date links: https://giotto-ai.github.io/gtda-docs/0.4.0/modules/generated/mapper/pipeline/gtda.mapper.make_mapper_pipeline.html and https://giotto-ai.github.io/gtda-docs/0.4.0/_images/mapper_pipeline.svg |
Description
I have a point cloud of 1920 nine-dimensional points. When applying Mapper with
DBSCAN
clustering as in the Christmas Santa notebook everything works fine. When I apply my own clustering algoritm with a precomputed distance matrix I get an error. Using Kepler Mapper I make this work by setting the parameterprecomputed=True
when callingmapper.map()
.PS! I used the color function from the Santa
.csv
file as a hack to make the code run. It worked for the basic clustering method.UPDATE: I added
point_cloud.csv
to the gist, I hope it works for reproduction.Steps/Code to Reproduce
https://gist.github.com/torlarse/43604dd09a98cc3f69166659cd6ddf9e
Expected Results
A Mapper complex :)
Actual Results
Please see gist for traceback.
Versions
Python 3.7.5 (tags/v3.7.5:5c02a39a0b, Oct 15 2019, 00:11:34) [MSC v.1916 64 bit (AMD64)]
NumPy 1.17.4
SciPy 1.3.3
joblib 0.14.0
Scikit-Learn 0.22
giotto-Learn 0.1.3
The text was updated successfully, but these errors were encountered: