[SYSTEMDS-2756] Scale and PCA builtin update #1123

Baunsgaard · 2020-12-11T19:43:29Z

No description provided.

mboehm7 · 2020-12-12T13:50:26Z

Thanks @Baunsgaard for eliminating the unnecessary colMeans in case of center and scale. However, please refrain from unnecessary changes of APIs and external behavior. I'll revert to the original script and simplify the scale case accordingly. As minor notes: the tests were failing because we only support right-hand-side broadcasting, all inputs for an ifelse are computed within a single DAG which led to unnecessary computation, the scale vector replaced NaN and not zeros, and please avoid empty lines before closing braces.

Baunsgaard · 2020-12-12T14:30:09Z

I changed the replace Nan because the NaN would be introduced in cases of division by zero. therefore it made sense to change the replacement on the scale factor. This would of cause not remove already existing NaN values in the matrix, but i would say it's fair to do such a cleanup before calling PCA, even if it change the external behavior.
the changes in API allowed us to make a PCA predict function, such that you could "train" a PCA, using the methods already provided and actually reconstruct the approximation of the original data using the extra returned parameters, and predict using PCA on unseen data. If the colMeans, and scale is not returned this is impossible. Furthermore those returns could simply be ignored by the end user with no impact on performance.
I wanted to have a measure of how much a PCA was effected by lossy compression since this would give a fair measure of lost data, therefore i needed the extra outputs in the API.

mboehm7 · 2020-12-12T14:47:24Z

ad 1) besides the changed overall behavior, the comment referred to replace(target=ScaleFactor, pattern=NaN, replacement=1e-16);, which would need to replace zero as this happens before the division.

ad 2/3) for local experiments this is fine, but for the hierarchy of builtin functions, we should aim for simple and stable APIs and find other ways to automatically remove redundancy (e.g., by lineage-based reuse). Passing all these intermediates around would quickly become really messy and confusing for both users and developers. Which PCA predict do you refer to - is there already a builtin function for it? Generally, please don't rewrite the builtin functions just to make them more amenable for compressed operations.

Baunsgaard · 2020-12-12T15:31:15Z

ad 1) besides the changed overall behavior, the comment referred to replace(target=ScaleFactor, pattern=NaN, replacement=1e-16);, which would need to replace zero as this happens before the division.

Yes, and exactly why i replace all the 0 values, to avoid introducing NaN. instead of removing the NaN on the full matrix after the division operation.

ad 2/3) for local experiments this is fine, but for the hierarchy of builtin functions, we should aim for simple and stable APIs and find other ways to automatically remove redundancy (e.g., by lineage-based reuse).

I agree simplicity is of utmost importance, but with the current PCA, we are limited to only apply the standard version without scaling if we want to be able to reuse the model on unseen data. This results in a limited system overall. The redundancy of the extra return values, should be handled already by the system.

Passing all these intermediates around would quickly become really messy and confusing for both users and developers. Which PCA predict do you refer to - is there already a builtin function for it?

I agree that it becomes messy, and it have been for a while, especially if you consider the neural network part of the system. where each layers weight and bias is parsed back. But these returns are necessary for inference on unseen data. I have added a PR with the PCA predict #1124 . Maybe it could be nice containing all the methods for an algorithm in the same file, such that the predict is located inside the same file?

Generally, please don't rewrite the builtin functions just to make them more amenable for compressed operations.

I would argue that these changes are for the "greater good", and therefore also better for compression, I am avoiding changing things just for compressions sake, but thanks for the reminder 😄 .

mboehm7 · 2020-12-12T16:04:15Z

ad 1) there is a mismatch between what you wanted to do and what your code actual did, the comment just pointed that out. The PR did this

replace(target=ScaleFactor, pattern=NaN, replacement=1e-16);

but you wanted to do this

replace(target=ScaleFactor, pattern=0, replacement=1e-16);

Baunsgaard · 2020-12-12T16:15:41Z

but you wanted to do this

ups, logic fine ... execution wrong. Great catch, thanks!

[SYSTEMDS-2756] Scale and PCA builtin update

059dbaf

asfgit closed this in d439f0e Dec 12, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SYSTEMDS-2756] Scale and PCA builtin update #1123

[SYSTEMDS-2756] Scale and PCA builtin update #1123

Baunsgaard commented Dec 11, 2020

mboehm7 commented Dec 12, 2020

Baunsgaard commented Dec 12, 2020

mboehm7 commented Dec 12, 2020

Baunsgaard commented Dec 12, 2020

mboehm7 commented Dec 12, 2020

Baunsgaard commented Dec 12, 2020

[SYSTEMDS-2756] Scale and PCA builtin update #1123

[SYSTEMDS-2756] Scale and PCA builtin update #1123

Conversation

Baunsgaard commented Dec 11, 2020

mboehm7 commented Dec 12, 2020

Baunsgaard commented Dec 12, 2020

mboehm7 commented Dec 12, 2020

Baunsgaard commented Dec 12, 2020

mboehm7 commented Dec 12, 2020

Baunsgaard commented Dec 12, 2020