New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SYSTEMDS-2944] Federated Sparsity Propagation #1863
Conversation
…ng the sparsity propagation from the federated worker to the coordinator NOTE: There are several TODOs and FIXMEs in the dml script NOTE: The java test class contains parts of the DMLScript class that is copied chore(AutomatedTestBase.java): add methods to read a scalar and a meta data file from the expected/reference test directory
…Type.FULL if it results in a single federated partition chore(FederationUtils.java): set the nnz only once to the total number of non-zeros when binding the federated responses chore(MatrixLineagePair.java): add the method getNnz which calls the same method on the matrix object
…t matrix according to the response of the federated instruction request
… and the respective dml scripts double check the test marked with FIXME
…int like in the matrix object chore(AggregateBinaryFEDInstruction.java): remove the cast to int of the blocksize
8ba2489
to
3592f44
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
I especially like the tests. They test many operations in one go violating the single element testing, but it is clear what is going on.
Thanks for the review @Baunsgaard :) |
If you want to avoid the duplicate maybe extending the class can help. (As in make another class in the test that extends the dmlscript class) |
LGTM too - I'll merge it in and fix the remaining issues in the process. |
After some time debugging (the long chain of operations), it turned out that the recent fix of hop-level sparsity propagation on already solved all issues of #1863. However, the test wrote meta data files with an incorrect number of non-zeros which lead to incorrect nnz propagation in local (which serves as the reference) and thus mismatches with federated where the meta-data files are handled differently.
Hi,
This PR adds the functionality of propagating the sparsity from the federated workers to the coordinator when executing an instruction without subsequent data consolidation at the worker. To achieve this, the workers respond to a EXEC_INST request with the number of non-zeros of their respective federated partition. Thereby, we can obtain and aggregate the number of non-zeros and subsequently set the number of non-zeros of the (still federated) matrix object.
In this PR, several instructions have been extended to support the before mentioned sparsity propagation. I also added a respective unit test which compares the propagated number of non-zeros to a reference. Unfortunately, I had to copy parts of the DMLScript.java code into the test class because the test needs direct access to the execution context of the script execution.
Additionally, I left three types of FIXME comments in the test dml script. In two of these comments, the underlying local operator (ifelse and cumsum) returns an incorrect number of non-zeros for some reason that I haven't discovered yet. The other FIXME comment relates to an incorrect computation of instructions like sum(X!=0) which is introduced by rewriting the constructed HOP.
NOTE: This PR is based on #1862 . Hence, we should wait with the merge until the other PR has been resolved.
Thanks for review :)