Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[#940] improvement: Optimize columnar shuffle integration #958

Merged
merged 6 commits into from
Jun 28, 2023

Conversation

summaryzb
Copy link
Contributor

What changes were proposed in this pull request?

  1. Make it possible to extend uniffle in spark3.
  2. Optimize the shuffleMetric when use columnar shuffle

Why are the changes needed?

#940

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Unit test

@summaryzb summaryzb changed the title [#940] improvement: make the shuffle [#940] improvement: Optimize columnar shuffle integration Jun 20, 2023
shuffleWriteMetrics.incRecordsWritten(1L);
// records is a row based semantic
if (isRowBased) {
shuffleWriteMetrics.incRecordsWritten(1L);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replace with this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add more comments to explain why we need this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it a common case? I think it binded the implement of gluten. Could we have more common interface?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments are added, Yes it's a common case.
All columnar shuffle use it's own serializer, all the serializer related work that is binded with implementation of the columnar framework not limited to gluten should be handled in the rss columnar shuffle writer

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we move this code to the method addRecord? Columnar shuffle won't call the addRecord method, will it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good idea

@@ -141,7 +142,8 @@ public WriteBufferManager(
this.requireMemoryRetryMax = bufferManagerOptions.getRequireMemoryRetryMax();
this.arrayOutputStream = new WrappedByteArrayOutputStream(serializerBufferSize);
// in columnar shuffle, the serializer here is never used
if (serializer != null) {
this.isRowBased = (serializer != null);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it common case? if we support other columnar shuffle like rapids, the other columnar shuffle may use non-null serializer.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can columnar shuffle be supported by adding configuration or adding a new constructor? This judgment seems a bit hard

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion, all columnar shuffle should pass null serializer to WriterBufferManager.

  1. WriterBufferManager should only do the buffer related work as it's name mean
  2. Columnar data framework is various, the integration work should be done in the implementation of rss columnar shuffle writer

Copy link
Contributor

@jerqi jerqi Jun 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can pass a non-serializer and ColumnarBatch to addRecord method although we use columnar shuffle. It's not equivalent.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can pass a non-serializer and ColumnarBatch to addRecord method although we use columnar shuffle. It's not equivalent.

Not only serialization but also partitioning should be handled and are both related with the implementation of the third party columnar framework, we'd better handle them outside of WriterBufferManager

For a shuffle, we can handle partitioning by the partitioner. You mean that partitioner is null for gluten, won't it?
In another word, we may use null serializer row based shuffle.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jerqi When in columnar shuffle, an element of the iterator is a ColumnarBatch, it's consists of many rows which will be partitioned to different partitionIds, current spark partitioner api cannot handle this scenario.
Actually the java partitioner is fake when use columnar shuffle in gluten, partitioner is implemented in cpp layer

Copy link
Contributor

@jerqi jerqi Jun 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we have null serializer for row based shuffle in the future? I think the answer is yes. Because some spark data don't need serialization although the data is organized by row format.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about indicate row based shuffle false in rssConf @jerqi

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok for me.

shuffleWriteMetrics.incRecordsWritten(1L);
// records is a row based semantic
if (isRowBased) {
shuffleWriteMetrics.incRecordsWritten(1L);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it a common case? I think it binded the implement of gluten. Could we have more common interface?

@xianjingfeng
Copy link
Member

Should we make all fields as protected variables?

private final Map<Integer, Integer> shuffleIdToNumMapTasks = Maps.newConcurrentMap();
private ShuffleManagerGrpcService service;
private GrpcServer shuffleManagerServer;
protected final String clientType;
Copy link
Contributor

@LuciferYang LuciferYang Jun 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why change all fields to protected? If this is necessary, it is also need to add some code comments to explain

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow the suggestion

@summaryzb
Copy link
Contributor Author

summaryzb commented Jun 24, 2023

Should we make all fields as protected variables?

No, fix this @xianjingfeng PTAL

@codecov-commenter
Copy link

codecov-commenter commented Jun 26, 2023

Codecov Report

Merging #958 (92a9fc8) into master (8a0ae4b) will decrease coverage by 0.83%.
The diff coverage is 91.66%.

@@             Coverage Diff              @@
##             master     #958      +/-   ##
============================================
- Coverage     55.00%   54.18%   -0.83%     
- Complexity     2466     2474       +8     
============================================
  Files           367      355      -12     
  Lines         19237    17996    -1241     
  Branches       1579     1726     +147     
============================================
- Hits          10582     9751     -831     
+ Misses         8008     7643     -365     
+ Partials        647      602      -45     
Impacted Files Coverage Δ
...pache/spark/shuffle/writer/WriteBufferManager.java 79.67% <85.71%> (-0.11%) ⬇️
.../java/org/apache/spark/shuffle/RssSparkConfig.java 98.18% <100.00%> (+0.05%) ⬆️

... and 43 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

Copy link
Contributor

@advancedxy advancedxy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thing about this integration, do we have any integration tests for this, so that we can caught any incompatibility before releasing.

@@ -141,7 +142,8 @@ public WriteBufferManager(
this.requireMemoryRetryMax = bufferManagerOptions.getRequireMemoryRetryMax();
this.arrayOutputStream = new WrappedByteArrayOutputStream(serializerBufferSize);
// in columnar shuffle, the serializer here is never used
if (serializer != null) {
this.isRowBased = (serializer != null);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can columnar shuffle be supported by adding configuration or adding a new constructor? This judgment seems a bit hard

+1. It seems a bit counterintuitive to rely on serializer == null to check whether the buffer writer manager should support columnar shuffle or not.

Is is possible for gluten to pass additional conf items to spark conf and in Uniffle side we can add a columnarSupport field in the BufferManagerOptions class.

Comment on lines 119 to 122
protected AtomicReference<String> id = new AtomicReference<>();
protected SparkConf sparkConf;
protected ShuffleWriteClient shuffleWriteClient;
protected DataPusher dataPusher;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you point to the gluten impl again?

It doesn't feel right to just expose these fields here, especially the id and sparkConf fields, they are not exposed in Spark's original shuffle manager.

I think these fields should be accessed at least via a getter/method, so uniffle is free to change its implementation later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

U'r right, id and dataPusher need to be extract to a method.
but sparkConf and shuffleWriteClient are just used in some constructor.

@summaryzb
Copy link
Contributor Author

Another thing about this integration, do we have any integration tests for this, so that we can caught any incompatibility before releasing.

Since Uniffle is kind of dependency for gluten, the integration tests should be placed in gluten side, this will be done after the release. For example we shall not test spark integration in hadoop-common project, acctually we fix the hadoop incompatibility in spark side.
Currently i run the tpcds in our production env to test the integration.

// that is handled by rss shuffle writer implementation
if (isRowBased) {
shuffleWriteMetrics.incRecordsWritten(1L);
}
Copy link
Contributor

@jerqi jerqi Jun 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about

List<ShuffleBlockInfo> dataPartition = addPartitionData(partitionId, serializedData, serializedDataLength, start);
if (isRowBased) {
    shuffleWriteMetrics.incRecordsWritten(1L);
}
return dataPartition;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's better

@@ -127,7 +127,7 @@ public RssShuffleWriter(
ShuffleWriteClient shuffleWriteClient,
RssShuffleHandle<K, V, C> rssHandle,
Function<String, Boolean> taskFailureCallback) {
LOG.warn("RssShuffle start write taskAttemptId data" + taskAttemptId);
LOG.warn("RssShuffleaskAttempt start write taskAttemptId data" + taskAttemptId);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a misspell.

@jerqi jerqi requested a review from loukey-lj June 27, 2023 06:55
Copy link
Contributor

@jerqi jerqi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, let @loukey-lj take another look.

@summaryzb
Copy link
Contributor Author

gentle ping @xianjingfeng PTAL

Copy link
Member

@xianjingfeng xianjingfeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@xianjingfeng xianjingfeng merged commit 0a42cfb into apache:master Jun 28, 2023
27 checks passed
@xianjingfeng
Copy link
Member

Merged. Thanks all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants