Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SYSTEMDS-2944] Federated Sparsity Propagation #1863

Closed
wants to merge 5 commits into from

Conversation

ywcb00
Copy link
Contributor

@ywcb00 ywcb00 commented Jul 15, 2023

Hi,
This PR adds the functionality of propagating the sparsity from the federated workers to the coordinator when executing an instruction without subsequent data consolidation at the worker. To achieve this, the workers respond to a EXEC_INST request with the number of non-zeros of their respective federated partition. Thereby, we can obtain and aggregate the number of non-zeros and subsequently set the number of non-zeros of the (still federated) matrix object.
In this PR, several instructions have been extended to support the before mentioned sparsity propagation. I also added a respective unit test which compares the propagated number of non-zeros to a reference. Unfortunately, I had to copy parts of the DMLScript.java code into the test class because the test needs direct access to the execution context of the script execution.
Additionally, I left three types of FIXME comments in the test dml script. In two of these comments, the underlying local operator (ifelse and cumsum) returns an incorrect number of non-zeros for some reason that I haven't discovered yet. The other FIXME comment relates to an incorrect computation of instructions like sum(X!=0) which is introduced by rewriting the constructed HOP.
NOTE: This PR is based on #1862 . Hence, we should wait with the merge until the other PR has been resolved.

Thanks for review :)

…ng the sparsity propagation from the federated worker to the coordinator

	NOTE: There are several TODOs and FIXMEs in the dml script
	NOTE: The java test class contains parts of the DMLScript class that is copied

chore(AutomatedTestBase.java): add methods to read a scalar and a meta data file from the expected/reference test directory
…Type.FULL if it results in a single federated partition

chore(FederationUtils.java): set the nnz only once to the total number of non-zeros when binding the federated responses

chore(MatrixLineagePair.java): add the method getNnz which calls the same method on the matrix object
…t matrix according to the response of the federated instruction request
… and the respective dml scripts

	double check the test marked with FIXME
…int like in the matrix object

chore(AggregateBinaryFEDInstruction.java): remove the cast to int of the blocksize
@ywcb00 ywcb00 force-pushed the feat/fed/io/sparsitypropagation branch from 8ba2489 to 3592f44 Compare July 16, 2023 07:55
Copy link
Contributor

@Baunsgaard Baunsgaard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

I especially like the tests. They test many operations in one go violating the single element testing, but it is clear what is going on.

@Baunsgaard
Copy link
Contributor

Also, David,

It seems like you did not finish your signing up to Apache. since your user does not show you are part of the Apache group. Maybe you want to do that ?

image

image

@ywcb00
Copy link
Contributor Author

ywcb00 commented Jul 21, 2023

Thanks for the review @Baunsgaard :)
I also like that aspect of the tests. However, I don't like the copied code parts from DMLScript. Any idea how to prevent this code duplication? Or do you think that it's fine like this?

@Baunsgaard
Copy link
Contributor

Baunsgaard commented Jul 22, 2023

Thanks for the review @Baunsgaard :) I also like that aspect of the tests. However, I don't like the copied code parts from DMLScript. Any idea how to prevent this code duplication? Or do you think that it's fine like this?

If you want to avoid the duplicate maybe extending the class can help. (As in make another class in the test that extends the dmlscript class)
Unfortunately sometimes the easy way is to duplicate some code in the testing. And it did not seem excessive in this case.

@mboehm7
Copy link
Contributor

mboehm7 commented Jul 25, 2023

LGTM too - I'll merge it in and fix the remaining issues in the process.

@mboehm7 mboehm7 closed this in 25b7351 Jul 25, 2023
mboehm7 added a commit that referenced this pull request Jul 25, 2023
After some time debugging (the long chain of operations), it turned
out that the recent fix of hop-level sparsity propagation on already
solved all issues of #1863.

However, the test wrote meta data files with an incorrect number
of non-zeros which lead to incorrect nnz propagation in local
(which serves as the reference) and thus mismatches with federated
where the meta-data files are handled differently.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
3 participants