[SYSTEMDS-2972] Initial Multi-Threaded transformencode #1261
[SYSTEMDS-2972] Initial Multi-Threaded transformencode #1261ilovemesomeramen wants to merge 11 commits intoapache:masterfrom
Conversation
Added sparse support
# Conflicts: # src/main/java/org/apache/sysds/runtime/transform/encode/ColumnEncoderDummycode.java # src/test/resources/datasets/homes3/homes.tfspec_dummy_all.json
phaniarnab
left a comment
There was a problem hiding this comment.
Thanks for the patch @ilovemesomeramen. LGTM.
I have no comments which must be addressed before merging. However, I have a few comments and suggestions for future discussions and commits.
I fixed a few formatting issues before merging.
| synchronized (sparseBlock.get(r)){ | ||
| sparseBlock.set(r,c,v); | ||
| } | ||
| }else{ | ||
| denseBlock.set(r,c,v); | ||
| } |
There was a problem hiding this comment.
Is the denseBlock/sparseBlock guaranteed to be allocated here?
why synchronize only sparse?
There was a problem hiding this comment.
Yes it must be allocated before
Since denseblocks are just a collection of arrays and at the moment of writing to the block there should be nothing that is reading we can write concurrently without any need of synchronization, on the other hand the sparse blocks are only row independent so we need to sync over rows.
| public void setApplyBlockSize(int blk) { | ||
| APPLY_BLOCKSIZE = blk; | ||
| } |
There was a problem hiding this comment.
Can you please add a test with non-zero APPLY_BLOCKSIZE?
There was a problem hiding this comment.
yes sure, i have it in my local testcases and forgot to add it
| for(ColumnEncoderComposite encoder : _columnEncoders) { | ||
| List<Callable<Object>> partialBuildTasks = encoder.getPartialBuildTasks(in, blockSize); | ||
| if(partialBuildTasks == null) { | ||
| partials.add(null); | ||
| continue; | ||
| } | ||
| partials.add(pool.invokeAll(partialBuildTasks)); | ||
| } | ||
| for(int e = 0; e < _columnEncoders.size(); e++) { | ||
| List<Future<Object>> partial = partials.get(e); | ||
| if(partial == null) | ||
| continue; | ||
| tasks.add(new ColumnMergeBuildPartialTask(_columnEncoders.get(e), partial)); | ||
| } |
There was a problem hiding this comment.
Discussion: This logic of creating tasks (column-wise row partition) restricts us from more sophisticated task creation with an arbitrary number of columns. This may not be a problem though.
There was a problem hiding this comment.
Since this PR i did a lot more testing and this partial building is rather complicated, especially since a ton of intermediates are being created increasing GC. At the moment partial building is not really viable in most scenario. This will be good to discuss on Friday.
| int blockSize = BUILD_BLOCKSIZE <= 0 ? in.getNumRows() : BUILD_BLOCKSIZE; | ||
| List<Callable<Integer>> tasks = new ArrayList<>(); |
There was a problem hiding this comment.
Discussion: What if we need column-specific block sizes in the future?
There was a problem hiding this comment.
Thats a good point.
This should not be a problem, since the encoders are independent we can just call the encoders with the blocksize we need.
So in the future we might get a array of blocksizes which we then just need to match
| _encoder.mergeBuildPartial(_partials, 0, _partials.size()); | ||
| return 1; |
There was a problem hiding this comment.
What is the significance of this hard-coded 1?
There was a problem hiding this comment.
I missed that. should be null and the callable should be a Callable.
This PR adds basic Multithreading capability to the transform encode implementation.
Each ColumnEncoder can be executed on a separate thread or can be split up into even smaller subjob which only apply to a certain row range
Initial benchmarks with 16CPUs show up to a 50x speed improvement in comparison to the old SystemML implementation.
Currently this code is dormant, which means a call to
transformencodein a DML script still uses a single threaded implementation. This will be changed when further improvements and testing are complete.Large Matrices (e.g. 1000000x1000) are still not viable due to suspected Thread starving. This will be addressed in a future PR with some sort of access partitioning (Radix/Range).
This PR also brings back sparse support for large dummycoded matrices, which was accidentally removed in a prior PR