Skip to content

[SYSTEMDS-2550] Balancing and data partitioning#1131

Closed
tobiasrieger wants to merge 16 commits intoapache:masterfrom
tobiasrieger:balancing_and_data_partitioning
Closed

[SYSTEMDS-2550] Balancing and data partitioning#1131
tobiasrieger wants to merge 16 commits intoapache:masterfrom
tobiasrieger:balancing_and_data_partitioning

Conversation

@tobiasrieger
Copy link
Contributor

This PR includes the closed PR #1113 and all changes proposed in its comments. It was rebased on master and consolidated to make it easier to merge

Changes list:

  • Added four new federated data partitioning schemes
    • ShuffleFederatedScheme
    • SubsampleToMinFederatedScheme
    • BalanceToAvgFederatedScheme
    • ReplicateToMaxFederatedScheme
  • Added runtime balancing parameter to parameter server (has a default parameter and is optional for the federated case)
  • Different runtime balancing schemes
    • RUN_MIN
    • CYCLE_AVG
    • CYCLE_MAX
  • Simplified the federated control thread to accommodate the coming N-batch frequency
    • There is now only one UDF, which computes a given number of batches. If more batches than the local epoch are specified it will cycle. This one function is sufficient for batch, N-Batch or Epoch Frequencies.
  • Provided a convenient way to create federated test matrices in the Automated Test Base
  • .getFedMapping() of FederationMap was refactored to getFRangeFDataMap to avoid confusion with the method of MatrixObject with the same name.
  • Updated and improved CNN.dml and TwoNN.dml scripts

@mboehm7
Copy link
Contributor

mboehm7 commented Dec 20, 2020

LGTM - thanks for the extensions @tobiasrieger. This patch is a good start for experiments on these balancing and data partitioning techniques.

During the merge, I only made a few tweaks here and there: added serial IDs to serializable classes, renamed getFRangeFDataMap to getMap, capitalized all class names, replaced the prints of non-implemented errors with NotImplementedExceptions, replaced the progress printing with log info messages (so they can be disabled if needed), reconfigured the matrix multiplies for shuffling, replication, and sampling to the available vcores (instead of 1), changed the wait in the federated paramserv test to the global fed-worker wait configuration, and made some minor formatting changes.

@asfgit asfgit closed this in 5dec562 Dec 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants