Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SYSTEMDS-2756] Scale and PCA builtin update #1123

Closed
wants to merge 1 commit into from

Conversation

Baunsgaard
Copy link
Contributor

No description provided.

@mboehm7
Copy link
Contributor

mboehm7 commented Dec 12, 2020

Thanks @Baunsgaard for eliminating the unnecessary colMeans in case of center and scale. However, please refrain from unnecessary changes of APIs and external behavior. I'll revert to the original script and simplify the scale case accordingly. As minor notes: the tests were failing because we only support right-hand-side broadcasting, all inputs for an ifelse are computed within a single DAG which led to unnecessary computation, the scale vector replaced NaN and not zeros, and please avoid empty lines before closing braces.

@asfgit asfgit closed this in d439f0e Dec 12, 2020
@Baunsgaard
Copy link
Contributor Author

  1. I changed the replace Nan because the NaN would be introduced in cases of division by zero. therefore it made sense to change the replacement on the scale factor. This would of cause not remove already existing NaN values in the matrix, but i would say it's fair to do such a cleanup before calling PCA, even if it change the external behavior.
  2. the changes in API allowed us to make a PCA predict function, such that you could "train" a PCA, using the methods already provided and actually reconstruct the approximation of the original data using the extra returned parameters, and predict using PCA on unseen data. If the colMeans, and scale is not returned this is impossible. Furthermore those returns could simply be ignored by the end user with no impact on performance.
  3. I wanted to have a measure of how much a PCA was effected by lossy compression since this would give a fair measure of lost data, therefore i needed the extra outputs in the API.

@mboehm7
Copy link
Contributor

mboehm7 commented Dec 12, 2020

ad 1) besides the changed overall behavior, the comment referred to replace(target=ScaleFactor, pattern=NaN, replacement=1e-16);, which would need to replace zero as this happens before the division.

ad 2/3) for local experiments this is fine, but for the hierarchy of builtin functions, we should aim for simple and stable APIs and find other ways to automatically remove redundancy (e.g., by lineage-based reuse). Passing all these intermediates around would quickly become really messy and confusing for both users and developers. Which PCA predict do you refer to - is there already a builtin function for it? Generally, please don't rewrite the builtin functions just to make them more amenable for compressed operations.

@Baunsgaard
Copy link
Contributor Author

ad 1) besides the changed overall behavior, the comment referred to replace(target=ScaleFactor, pattern=NaN, replacement=1e-16);, which would need to replace zero as this happens before the division.

Yes, and exactly why i replace all the 0 values, to avoid introducing NaN. instead of removing the NaN on the full matrix after the division operation.

ad 2/3) for local experiments this is fine, but for the hierarchy of builtin functions, we should aim for simple and stable APIs and find other ways to automatically remove redundancy (e.g., by lineage-based reuse).

I agree simplicity is of utmost importance, but with the current PCA, we are limited to only apply the standard version without scaling if we want to be able to reuse the model on unseen data. This results in a limited system overall. The redundancy of the extra return values, should be handled already by the system.

Passing all these intermediates around would quickly become really messy and confusing for both users and developers. Which PCA predict do you refer to - is there already a builtin function for it?

I agree that it becomes messy, and it have been for a while, especially if you consider the neural network part of the system. where each layers weight and bias is parsed back. But these returns are necessary for inference on unseen data. I have added a PR with the PCA predict #1124 . Maybe it could be nice containing all the methods for an algorithm in the same file, such that the predict is located inside the same file?

Generally, please don't rewrite the builtin functions just to make them more amenable for compressed operations.

I would argue that these changes are for the "greater good", and therefore also better for compression, I am avoiding changing things just for compressions sake, but thanks for the reminder 😄 .

@mboehm7
Copy link
Contributor

mboehm7 commented Dec 12, 2020

ad 1) there is a mismatch between what you wanted to do and what your code actual did, the comment just pointed that out. The PR did this

replace(target=ScaleFactor, pattern=NaN, replacement=1e-16);

but you wanted to do this

replace(target=ScaleFactor, pattern=0, replacement=1e-16);

@Baunsgaard
Copy link
Contributor Author

but you wanted to do this

ups, logic fine ... execution wrong. Great catch, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants