Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integration into linfa #3

Closed
bytesnake opened this issue Mar 16, 2021 · 9 comments · Fixed by #5
Closed

Integration into linfa #3

bytesnake opened this issue Mar 16, 2021 · 9 comments · Fixed by #5
Assignees
Labels
enhancement New feature or request

Comments

@bytesnake
Copy link

I just saw your post on Reddit, awesome work! I'm the maintainer of linfa and thought about implementing t-SNE as a transformative dimensionality reduction technique in the past, but never came to it. This crate can take off a lot of work for us. We would implement a wrapper which adepts your algorithm by:

  • implementing builder style pattern for configuration
  • using datasets for input/output
  • implementing transform trait for the algorithm

Sounds good? I just quickly glanced at the source code and three things stood out which could be improved:

  • make csv dependency optional, sometimes it's not necessary to pull that in
  • make algorithm generic for num_traits::Float
  • what is about error handling? Can any part of your algorithm fail, especially what happens if there are NaNs in your data or parameters are mis-configured (e.g. perplexity negative)
@frjnn
Copy link
Owner

frjnn commented Mar 16, 2021

awesome work!

Thanks!

We would implement a wrapper which adepts your algorithm by (..)

Sounds good to me.

make csv dependency optional, sometimes it's not necessary to pull that in

Ok, easily done.

make algorithm generic for num_traits::Float.

Sure.

what is about error handling? (...)

Currently not much to be honest. There's just a check on the perplexity value and NaNs are free to spread.

@frjnn frjnn added the enhancement New feature or request label Mar 16, 2021
@frjnn frjnn self-assigned this Mar 16, 2021
@bytesnake
Copy link
Author

k I will ping back here once I have a minimal PR

@bytesnake
Copy link
Author

bytesnake commented Mar 18, 2021

I opened a PR here rust-ml/linfa#101. I'm running into two problems with the Iris flower dataset:

  • everything is NaN for theta=0
  • stack overflow when perplexity=1 (is such a low perplexity reasonable?)

@frjnn
Copy link
Owner

frjnn commented Mar 18, 2021

Could you please specify the full configurations of parameters that you used?

@bytesnake
Copy link
Author

for the stack overflow:

        bhtsne::run(
            &mut data, 
            nsamples,
            nfeatures,
            &mut y,
            2,
            1.0,
            0.5,
            false,
            2000,
            250,
            250,
        );

for the NaN output:

        bhtsne::run(
            &mut data, 
            nsamples,
            nfeatures,
            &mut y,
            2,
            15.0,
            0.0,
            false,
            2000,
            250,
            250,
        );

@frjnn
Copy link
Owner

frjnn commented Mar 19, 2021

The former error was caused by an overflow happening during the computation of the optimal entropy for the P distribution (it is done sort of by applying a binary search over the real numbers and for very small perplexity values it can take some iterations). Although unusual, as in the paper values between 5 and 50 are recommended, a perplexity value of 1.0 is fine and the algorithm should be capable of handling the case. The latter (the NaN output one) was caused by the same issue combined with a bug in the squared euclidean distance matrix function.

They should both be fixed now. I'm currently working on switching to num_traits::Float.

@bytesnake
Copy link
Author

sounds good, I'm still a bit confused that you got a stack overflow though you are not using recursion anywhere in your algorithm. (at least that's where I ran into that last time) you still have to push your changes 😄
I would also recommend to move the rng init here out of the loop, because it can be really slow:

.sample(&mut rand::thread_rng())

@frjnn
Copy link
Owner

frjnn commented Mar 19, 2021

I would also recommend to move the rng init here out of the loop, because it can be really slow.

Done, thanks for the tip.

you still have to push your changes

Also done.

(...) you are not using recursion anywhere in your algorithm.

 bhtsne::run(
            &mut data, 
            nsamples,
            nfeatures,
            &mut y,
            2,
            1.0,
            0.5, // theta
            false,
            2000,
            250,
            250,
        );

theta not being set to 0.0 implies that the algorithm uses the Barnes-Hut acceleration that indeed does some recursive calls during the gradient computation.

If you find any other quirks please let me know.

Also I'd like to ask the following question: why is the num_traits::Float trait necessary? Why would you need f64s?

@bytesnake
Copy link
Author

theta not being set to 0.0 implies that the algorithm uses the Barnes-Hut acceleration that indeed does some recursive calls during the gradient computation.

I see!

Also I'd like to ask the following question: why is the num_traits::Float trait necessary? Why would you need f64s?

mainly for ergonomic reasons, so that people can choose the precision of their floating points without the need to cast

@frjnn frjnn linked a pull request Mar 21, 2021 that will close this issue
@frjnn frjnn closed this as completed in #5 Mar 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants