Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Improvement] Add hdfs path health check to AppBalanceSelectStorageStrategy #210

Merged
merged 24 commits into from
Oct 9, 2022

Conversation

smallzhongfeng
Copy link
Contributor

What changes were proposed in this pull request?

The detection logic of abnormal paths is also added to the APP_BALANCE strategy.

Why are the changes needed?

Anomaly detection can be performed when selecting a remote path to avoid selecting a problematic path.

Does this PR introduce any user-facing change?

The parameters of the original IO_SAMPLE strategy are reused and the names are changed.

  1. rss.coordinator.remote.storage.select.strategy support APP_BALANCE and IO_SAMPLE, APP_BALANCE selection strategy based on the number of apps, IO_SAMPLE selection strategy based on time consumption of reading and writing files.
  2. rss.coordinator.remote.storage.schedule.time , if user choose IO_SAMPLE, file will be read and written at regular intervals.
  3. rss.coordinator.remote.storage.schedule.file.size , the size of each read / write HDFS file.
  4. rss.coordinator.remote.storage.schedule.access.times, number of times to read and write HDFS files.

How was this patch tested?

Passed uts.

@codecov-commenter
Copy link

codecov-commenter commented Sep 11, 2022

Codecov Report

Merging #210 (82d36c4) into master (f1cb43f) will decrease coverage by 0.15%.
The diff coverage is 66.66%.

@@             Coverage Diff              @@
##             master     #210      +/-   ##
============================================
- Coverage     59.31%   59.16%   -0.16%     
+ Complexity     1345     1339       -6     
============================================
  Files           161      163       +2     
  Lines          8780     8810      +30     
  Branches        828      833       +5     
============================================
+ Hits           5208     5212       +4     
- Misses         3304     3332      +28     
+ Partials        268      266       -2     
Impacted Files Coverage Δ
...fle/coordinator/AbstractSelectStorageStrategy.java 20.00% <20.00%> (ø)
...nator/LowestIOSampleCostSelectStorageStrategy.java 71.42% <69.35%> (-2.65%) ⬇️
...apache/uniffle/coordinator/ApplicationManager.java 83.80% <70.73%> (-6.47%) ⬇️
...e/coordinator/AppBalanceSelectStorageStrategy.java 76.00% <75.00%> (+3.65%) ⬆️
...a/org/apache/uniffle/common/RemoteStorageInfo.java 96.29% <100.00%> (+1.85%) ⬆️
...rg/apache/uniffle/coordinator/CoordinatorConf.java 97.05% <100.00%> (ø)
...pache/uniffle/server/ShuffleServerGrpcService.java 0.90% <0.00%> (-0.02%) ⬇️
...orage/handler/impl/LocalFileServerReadHandler.java 77.96% <0.00%> (ø)
... and 2 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

smallzhongfeng added 2 commits September 11, 2022 23:12

Map<String, RankValue> getRemoteStoragePathRankValue();
void setFs(FileSystem fs);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We would better not assume our remote storage is only FileSystem.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you elaborate more, or give me some other examples ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to support S3 , COS, Redis or some DBs, it's not suitable for us.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I can remove this method, thinking that if I want to support redis or other databases, I need to re implement the sortPathByRankValue method. It's a huge job.

@jerqi
Copy link
Contributor

jerqi commented Sep 14, 2022

Like discussion at dev mail list, we will freeze the code and cut 0.6 version branch in September 15, we will not merge this pr before I cut 0.6 version branch, are you ok?

1 similar comment
@jerqi
Copy link
Contributor

jerqi commented Sep 14, 2022

Like discussion at dev mail list, we will freeze the code and cut 0.6 version branch in September 15, we will not merge this pr before I cut 0.6 version branch, are you ok?

@smallzhongfeng
Copy link
Contributor Author

Like discussion at dev mail list, we will freeze the code and cut 0.6 version branch in September 15, we will not merge this pr before I cut 0.6 version branch, are you ok?

No problem, I got it.

Map<String, RemoteStorageInfo> getAppIdToRemoteStorageInfo();

Map<String, RankValue> getRemoteStoragePathRankValue();
List<Map.Entry<String, RankValue>> sortPathByRankValue(String path, Path testPath, long time, boolean isHealthy);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Path seems to be a concept of filesystem. It may be not general.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can support fileSystem at the beginning. If we need to support db, then there may be more things to consider, or we can add a parameter. If it is a file supported by fileSystem, we will turn on this switch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For S3 , COS, it's OK, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least We need consider the object Storage.

Copy link
Contributor Author

@smallzhongfeng smallzhongfeng Sep 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

�IIRC, object storage such as s3 also has the concept of Path.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But cos doesn't seem to have this concept.

Copy link
Member

@zuston zuston Sep 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I know, the object storage also can be accessed by Hadoop Filesystem Interface (HCFS), like s3/oss/cos. So this is not a problem

But if we want to involve more storages which are not implemented by HCFS or speed up specified api which is only compatible with original storage api, maybe we should implement uniffle's own filesystem, it can refer to Flink.

Please let me know if I'm wrong.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for your reminder. Although I am not very impressed with this, I have used a similar method to read files in s3.

@smallzhongfeng
Copy link
Contributor Author

For object storage, there may be more S3, but for cos, I don't know this very well. Do you have any good suggestions? @jerqi

@jerqi
Copy link
Contributor

jerqi commented Sep 20, 2022

For object storage, there may be more S3, but for cos, I don't know this very well. Do you have any good suggestions? @jerqi

Do the S3 have the concept of path? It only have the concept of bucket.

@smallzhongfeng
Copy link
Contributor Author

Could we support anomaly detection of hdfs files first?

@jerqi
Copy link
Contributor

jerqi commented Sep 21, 2022

Could we support anomaly detection of hdfs files first?

We should have more general interface. We can only implement part of them.

@zuston
Copy link
Member

zuston commented Sep 21, 2022

For object storage, there may be more S3, but for cos, I don't know this very well. Do you have any good suggestions? @jerqi

Do the S3 have the concept of path? It only have the concept of bucket.

Yes, the s3 compatible filesystem of Hadoop has implemented the abstraction of Path, but it's slower than the s3 original api when using list files/dirs. I think cos is the same with s3.

By the way, I implemented the hadoop filesystem api for internal object store in the past, so have some experience on it.

@jerqi
Copy link
Contributor

jerqi commented Sep 21, 2022

For object storage, there may be more S3, but for cos, I don't know this very well. Do you have any good suggestions? @jerqi

Do the S3 have the concept of path? It only have the concept of bucket.

Yes, the s3 compatible filesystem of Hadoop has implemented the abstraction of Path, but it's slower than the s3 original api when using list files/dirs. I think cos is the same with s3.

By the way, I implemented the hadoop filesystem api for internal object store in the past, so have some experience on it.

We shouldn't relay on the filesystem of object storage too much. Some operations of object storage is slow. For example, list operation, rename operation and append operation.

@smallzhongfeng
Copy link
Contributor Author

We should have more general interface. We can only implement part of them.

I think we only need to provide a path that includes the remote, and the rest of the methods can be implemented by judging different storage systems.

@jerqi
Copy link
Contributor

jerqi commented Sep 22, 2022

We should have more general interface. We can only implement part of them.

I think we only need to provide a path that includes the remote, and the rest of the methods can be implemented by judging different storage systems.

My doubt point is that the concept of path is not general one for the storage.

@smallzhongfeng
Copy link
Contributor Author

So I did not use the concept of path for the time being, but used a string to represent.

@smallzhongfeng
Copy link
Contributor Author

smallzhongfeng commented Sep 22, 2022

So I did not use the concept of path for the time being, but used a string to represent.

The concept of subsequent use of buckets or paths can be implemented through the readAndWrite interface.

@jerqi
Copy link
Contributor

jerqi commented Sep 22, 2022

So I did not use the concept of path for the time being, but used a string to represent.

The concept of subsequent use of buckets or paths can be implemented through the readAndWrite interface.

It's ok.

Map<String, RemoteStorageInfo> getAppIdToRemoteStorageInfo();

Map<String, RankValue> getRemoteStoragePathRankValue();
List<Map.Entry<String, RankValue>> readAndWrite(String path);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a little doubt, why do the readAndWrite become the only interface method for SelectStorageStrategy? I can rank by many methods, it's not necessary to read and write.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The meaning of the readAndWrite interface is to perform anomaly detection of remote paths. For hdfs, reading and writing files is a more direct way to detect, so I took this name, but the real comparison method comes from sortPathByRankValue, which I do not have. The reason for defining it as an interface is because the objects to be sorted may be different, and this is left to different storage methods for implementation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The meaning of the readAndWrite interface is to perform anomaly detection of remote paths. For hdfs, reading and writing files is a more direct way to detect, so I took this name, but the real comparison method comes from sortPathByRankValue, which I do not have. The reason for defining it as an interface is because the objects to be sorted may be different, and this is left to different storage methods for implementation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

readAndWrite is not a good name. It's strange for the interface SelectStorageStrategy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, updated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add a method pickStorage or selectStorage for this interface?

Copy link
Contributor Author

@smallzhongfeng smallzhongfeng Sep 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that might make more sense for this interface, I will add.

@smallzhongfeng
Copy link
Contributor Author

PTAL @jerqi

import org.apache.uniffle.coordinator.LowestIOSampleCostSelectStorageStrategy.RankValue;

public interface SelectStorageStrategy {

RemoteStorageInfo pickRemoteStorage(String appId);
List<Map.Entry<String, RankValue>> detectStorage(String path);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

path -> uri?
Path is not general concept for storage. Maybe we should use uri.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, good suggestion.

Map<String, RemoteStorageInfo> getAppIdToRemoteStorageInfo();

Map<String, RankValue> getRemoteStoragePathRankValue();
List<Map.Entry<String, RankValue>> pickStorage(List<Map.Entry<String, RankValue>> pathList);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we give pathList a better name?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have any good advice?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about uriRankings?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uriList or uris may be better?
Why do the method pickStorage return a list?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I modified it to let pickStorage return to a selected path.

Map<String, RemoteStorageInfo> getAppIdToRemoteStorageInfo();

Map<String, RankValue> getRemoteStoragePathRankValue();
String pickStorage(List<Map.Entry<String, RankValue>> uris, String appId);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we return RemoteStorageInfo?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need, because the return value of pickStorage only needs to be passed to incRemoteStorageCounter, incRemoteStorageCounter only needs String type, and String type is more general than RemoteStorageInfo.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need, because the return value of pickStorage only needs to be passed to incRemoteStorageCounter, incRemoteStorageCounter only needs String type, and String type is more general than RemoteStorageInfo.

String is too general to describe what we want to return. RemoteStorageInfo is more accurate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, updated.

Path testPath = new Path(rssTest);
try {
FileSystem fs = HadoopFilesystemProvider.getFilesystem(remotePath, hdfsConf);
for (int j = 0; j < readAndWriteTimes; j++) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we just detect the health of storage, it seems unnecessary for us write and read them multiple times.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For appBalance, it can indeed be changed once, but for io sampling, the number of times can be increased, perhaps we can set the parameter to 1 by default.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For appBalance, it shouldn't be a configuration option.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

private final Configuration hdfsConf;
private final int fileSize;
private final int readAndWriteTimes;
private boolean remotePathIsHealthy = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this variable remotePathIsHealthy mean?
This class is a strategy. I can't get the meaning of this varibale.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This variable is used to determine whether the path is normal, and to pass the parameter into the method sortPathByRankValue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This variable is used to determine whether the path is normal, and to pass the parameter into the method sortPathByRankValue.

One strategy should have multiple paths, every path is healthy, the variable is true?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know that there are many paths, but this attribute is applied to each path. The healthy path is processed by sortPathByRankValue according to remotePathIsHealthy as true, and corresponding logic is processed when it is false.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can look at the logic of sortPathByRankValue. When we judge by this parameter, it is a normal path, and then deal with it accordingly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the variable necessary to become member field?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

uris = uris.stream().filter(rv -> rv.getValue().getReadAndWriteTime().get() != Long.MAX_VALUE).sorted(
Comparator.comparingInt(entry -> entry.getValue().getAppNum().get())).collect(Collectors.toList());
} else {
// If all paths are unhealthy, assign paths according to the number of apps
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add some logs and metrics when all paths are unhealthy?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I will add.

conf.getLong(CoordinatorConf.COORDINATOR_REMOTE_STORAGE_SCHEDULE_TIME), TimeUnit.MILLISECONDS);
}

public void checkReadAndWrite() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we put this logic into the class SelectStorageStrategy?

@jerqi
Copy link
Contributor

jerqi commented Sep 27, 2022

Maybe we should have a abstract class to define the common behaviors of SelectStorageStrategy.

@smallzhongfeng
Copy link
Contributor Author

smallzhongfeng commented Oct 8, 2022

This modification has done three things

  1. An abstract class AbstractSelectStorageStrategy is added to provide a method for reading and writing hdfs paths.
  2. Added logs of the sorted storage list.
  3. Removed a global variable of isHealthy and changed it to a local variable.

@smallzhongfeng

This comment was marked as resolved.

@smallzhongfeng
Copy link
Contributor Author

PTAL @jerqi

}
}

@Override
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't implement the method detectStorage and pickStorage, we can remove them in the abstract class.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Contributor

@jerqi jerqi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @smallzhongfeng , wait for CI

@jerqi jerqi merged commit c89f95c into apache:master Oct 9, 2022
@jerqi
Copy link
Contributor

jerqi commented Oct 9, 2022

@smallzhongfeng
Copy link
Contributor Author

https://github.com/apache/incubator-uniffle/actions/runs/3213433147 There is a flay test. Could you help me fix it? test-reports-spark3.2.0 (1).zip

OK, I will fix it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants