Skip to content

[CELEBORN-914] Support memory file storage#2300

Closed
FMX wants to merge 40 commits intoapache:mainfrom
FMX:B914
Closed

[CELEBORN-914] Support memory file storage#2300
FMX wants to merge 40 commits intoapache:mainfrom
FMX:B914

Conversation

@FMX
Copy link
Contributor

@FMX FMX commented Feb 19, 2024

What changes were proposed in this pull request?

To support memory file storage.

Why are the changes needed?

To improve shuffle performance for small shuffle files.

Design doc: https://docs.google.com/document/d/1SM-oOM0JHEIoRHTYhE9PYH60_1D3NMxDR50LZIM7uW0/edit?usp=sharing

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Pass GA and manually test on a cluster.

@FMX FMX marked this pull request as draft February 19, 2024 01:20
@FMX FMX force-pushed the B914 branch 2 times, most recently from fdfff8d to a0e672f Compare February 19, 2024 08:18
@codecov
Copy link

codecov bot commented Feb 19, 2024

Codecov Report

Attention: Patch coverage is 73.15436% with 40 lines in your changes are missing coverage. Please review.

Project coverage is 49.33%. Comparing base (21d5698) to head (52cd7d8).
Report is 8 commits behind head on main.

Files Patch % Lines
...java/org/apache/celeborn/common/meta/FileInfo.java 62.86% 10 Missing and 3 partials ⚠️
...rg/apache/celeborn/common/meta/MemoryFileInfo.java 52.95% 8 Missing ⚠️
...rg/apache/celeborn/common/meta/ReduceFileMeta.java 70.00% 6 Missing ⚠️
...he/celeborn/common/util/ShuffleBlockInfoUtils.java 83.34% 0 Missing and 6 partials ⚠️
...g/apache/celeborn/common/protocol/StorageInfo.java 50.00% 2 Missing ⚠️
...cala/org/apache/celeborn/common/CelebornConf.scala 88.24% 1 Missing and 1 partial ⚠️
...org/apache/celeborn/common/util/PbSerDeUtils.scala 0.00% 2 Missing ⚠️
...e/celeborn/common/network/buffer/ChunkBuffers.java 75.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2300      +/-   ##
==========================================
+ Coverage   48.96%   49.33%   +0.38%     
==========================================
  Files         209      211       +2     
  Lines       13102    13238     +136     
  Branches     1134     1149      +15     
==========================================
+ Hits         6414     6530     +116     
- Misses       6270     6276       +6     
- Partials      418      432      +14     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@FMX FMX force-pushed the B914 branch 5 times, most recently from efdf6e3 to 6cc81ac Compare February 22, 2024 02:25
@FMX FMX marked this pull request as ready for review February 26, 2024 06:42
@FMX FMX force-pushed the B914 branch 2 times, most recently from a29e592 to 462f5c4 Compare February 29, 2024 08:58
}

public boolean addStream(long streamId) {
ReduceFileMeta reduceFileMeta = (ReduceFileMeta) fileMeta;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should not use actual type ReduceFileMeta directly, Can we delegate stream operation to FileMeta or use instanceof?

return length;
}

public void addLength(int length) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MemoryInfo.addLength seems the same like DiskFileInfo.updateBytesFlushed

import org.apache.celeborn.common.network.buffer.ManagedBuffer;

public interface ManagedBuffers {
int numChunks();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add @OverRide to all subclasses

private final int numChunks;
private final CompositeByteBuf buffer;

public MemoryFileManagedBuffers(MemoryFileInfo memoryFileInfo) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can create BaseChunk/ReduceManagedBuffers to eliminate duplicate codes

return indexMap;
}

public static void reorganizeBuffer(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reorganizeBuffer -> sortBufferByMapRange?

.categories("worker")
.doc("Max ratio of direct memory to store shuffle data")
.version("0.2.0")
.doc("Max ratio of direct memory to store shuffle data.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0.4.1->0.5, what's the meaning about the default value

null)
}

if (location.getStorageInfo.localDiskAvailable() || location.getStorageInfo.HDFSAvailable()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if -> else

<include>**/*Test*.*</include>
</includes>
<reportsDirectory>${project.build.directory}/surefire-reports</reportsDirectory>
<argLine>${argLine} -ea -Xmx4g -Xss4m -XX:MaxMetaspaceSize=2g -XX:ReservedCodeCacheSize=128m ${extraJavaTestArgs} -Dio.netty.tryReflectionSetAccessible=true</argLine>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't change that in this pr

createFile(writerContext);

// Reduce partition data writers support memory storage now
if (supportInMemory && createFileResult._1() != null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO better to delegate the actual writing and eviction to something like a TieredFileStorage. This would make the PartitionDataWriter more clear.

FMX added 8 commits March 20, 2024 15:26
# Conflicts:
#	worker/src/main/scala/org/apache/celeborn/service/deploy/worker/storage/StorageManager.scala
# Conflicts:
#	client/src/main/java/org/apache/celeborn/client/read/CelebornInputStream.java
#	worker/src/main/scala/org/apache/celeborn/service/deploy/worker/FetchHandler.scala
@codecov-commenter
Copy link

codecov-commenter commented Apr 7, 2024

Codecov Report

Attention: Patch coverage is 9.49367% with 143 lines in your changes are missing coverage. Please review.

Project coverage is 38.98%. Comparing base (121395f) to head (53555a1).
Report is 51 commits behind head on main.

Current head 53555a1 differs from pull request most recent head 137ab52

Please upload reports for the commit 137ab52 to get more accurate results.

Files Patch % Lines
...he/celeborn/common/util/ShuffleBlockInfoUtils.java 0.00% 36 Missing ⚠️
...java/org/apache/celeborn/common/meta/FileInfo.java 0.00% 35 Missing ⚠️
...rg/apache/celeborn/common/meta/MemoryFileInfo.java 0.00% 24 Missing ⚠️
...rg/apache/celeborn/common/meta/ReduceFileMeta.java 0.00% 20 Missing ⚠️
...leborn/common/network/buffer/FileChunkBuffers.java 0.00% 6 Missing ⚠️
...born/common/network/buffer/MemoryChunkBuffers.java 0.00% 6 Missing ⚠️
...cala/org/apache/celeborn/common/CelebornConf.scala 68.43% 6 Missing ⚠️
...e/celeborn/common/network/buffer/ChunkBuffers.java 0.00% 4 Missing ⚠️
.../org/apache/celeborn/common/meta/DiskFileInfo.java 0.00% 2 Missing ⚠️
...g/apache/celeborn/common/protocol/StorageInfo.java 50.00% 2 Missing ⚠️
... and 1 more
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2300      +/-   ##
==========================================
- Coverage   40.17%   38.98%   -1.19%     
==========================================
  Files         218      219       +1     
  Lines       13742    13547     -195     
  Branches     1214     1191      -23     
==========================================
- Hits         5520     5280     -240     
- Misses       7905     7966      +61     
+ Partials      317      301      -16     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

FMX added 3 commits April 7, 2024 17:45
# Conflicts:
#	worker/src/test/java/org/apache/celeborn/service/deploy/worker/storage/local/DiskReducePartitionDataWriterSuiteJ.java
for (int i = startMapIndex; i < endMapIndex; i++) {
List<ShuffleBlockInfo> blockInfos = indexMap.get(i);
if (blockInfos != null) {
for (ShuffleBlockInfo blockInfo : blockInfos) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just calculate start and end, then slice once from sortedByteBuf. Check contiguous during calculation.

}
// trigger resume
if (memoryUsage < resumeThreshold) {
if (workerMemoryUsageRatio() < resumeRatio) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just use memoryUsage / (double) (maxDirectMemory) here to avoid duplicate calculation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a gauge to export this usage ratio to Grafana.

MemoryManager.instance().incrementDiskBuffer(numBytes);
// read flush buffer to generate correct chunk offsets
// data header layout (mapId, attemptId, nextBatchId, length)
ByteBuffer headerBuf = ByteBuffer.allocate(16);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to check if flushBuffer exceeds chunkSize, if yes, do the following logic, else just use numBytes.

FMX added 4 commits May 22, 2024 20:56
# Conflicts:
#	assets/grafana/celeborn-dashboard.json
Copy link
Contributor

@waitinfuture waitinfuture left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the effort! Merging to main(v0.5.0). Please add more UTs for this feature in followup PRs.

@FMX FMX closed this in 89d56c9 May 23, 2024
@s0nskar
Copy link
Contributor

s0nskar commented Aug 14, 2024

@FMX metrics added by this PR is not added to the Celeborn website Monitoring page. Also, should we start adding such changes in the release notes as well. WDYT?

FMX pushed a commit that referenced this pull request Aug 26, 2024
…nitoring.md

### What changes were proposed in this pull request?

Adding documentation for missing memory file storage metrics.

### Why are the changes needed?

Few new metrics were added in #2300 but they were missing their documentation in monitoring.md

### Does this PR introduce _any_ user-facing change?

NO

### How was this patch tested?

NA

Closes #2705 from s0nskar/memory_metrics.

Authored-by: Sanskar Modi <sanskarmodi97@gmail.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
FMX pushed a commit that referenced this pull request Aug 26, 2024
…nitoring.md

Adding documentation for missing memory file storage metrics.

Few new metrics were added in #2300 but they were missing their documentation in monitoring.md

NO

NA

Closes #2705 from s0nskar/memory_metrics.

Authored-by: Sanskar Modi <sanskarmodi97@gmail.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
(cherry picked from commit b7027b6)
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
zaynt4606 pushed a commit to zaynt4606/celeborn that referenced this pull request Aug 29, 2024
…nitoring.md

### What changes were proposed in this pull request?

Adding documentation for missing memory file storage metrics.

### Why are the changes needed?

Few new metrics were added in apache#2300 but they were missing their documentation in monitoring.md

### Does this PR introduce _any_ user-facing change?

NO

### How was this patch tested?

NA

Closes apache#2705 from s0nskar/memory_metrics.

Authored-by: Sanskar Modi <sanskarmodi97@gmail.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
FMX pushed a commit that referenced this pull request Oct 8, 2024
…Ids of worker service log in startup document

### What changes were proposed in this pull request?

Add `emptyFilePrimaryIds` and `emptyFileReplicaIds` of worker service log in startup document.

### Why are the changes needed?

#2300 has added `emptyFilePrimaryIds` and `emptyFileReplicaIds` of startup log of for worker service in `Controller`, which should also add into the log of startup document.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

No.

Closes #2774 from SteNicholas/CELEBORN-914.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
FMX pushed a commit that referenced this pull request Oct 8, 2024
…Ids of worker service log in startup document

### What changes were proposed in this pull request?

Add `emptyFilePrimaryIds` and `emptyFileReplicaIds` of worker service log in startup document.

### Why are the changes needed?

#2300 has added `emptyFilePrimaryIds` and `emptyFileReplicaIds` of startup log of for worker service in `Controller`, which should also add into the log of startup document.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

No.

Closes #2774 from SteNicholas/CELEBORN-914.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
(cherry picked from commit c5ff12b)
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
turboFei pushed a commit to turboFei/incubator-celeborn that referenced this pull request Oct 10, 2024
…Ids of worker service log in startup document

### What changes were proposed in this pull request?

Add `emptyFilePrimaryIds` and `emptyFileReplicaIds` of worker service log in startup document.

### Why are the changes needed?

apache#2300 has added `emptyFilePrimaryIds` and `emptyFileReplicaIds` of startup log of for worker service in `Controller`, which should also add into the log of startup document.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

No.

Closes apache#2774 from SteNicholas/CELEBORN-914.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
wankunde pushed a commit to wankunde/celeborn that referenced this pull request Oct 11, 2024
…nitoring.md

### What changes were proposed in this pull request?

Adding documentation for missing memory file storage metrics.

### Why are the changes needed?

Few new metrics were added in apache#2300 but they were missing their documentation in monitoring.md

### Does this PR introduce _any_ user-facing change?

NO

### How was this patch tested?

NA

Closes apache#2705 from s0nskar/memory_metrics.

Authored-by: Sanskar Modi <sanskarmodi97@gmail.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
HolyLow pushed a commit to HolyLow/celeborn that referenced this pull request Oct 14, 2024
…Ids of worker service log in startup document

### What changes were proposed in this pull request?

Add `emptyFilePrimaryIds` and `emptyFileReplicaIds` of worker service log in startup document.

### Why are the changes needed?

apache#2300 has added `emptyFilePrimaryIds` and `emptyFileReplicaIds` of startup log of for worker service in `Controller`, which should also add into the log of startup document.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

No.

Closes apache#2774 from SteNicholas/CELEBORN-914.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants