-
Notifications
You must be signed in to change notification settings - Fork 145
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Improvement] Add hdfs path health check to AppBalanceSelectStorageStrategy #210
Conversation
Codecov Report
@@ Coverage Diff @@
## master #210 +/- ##
============================================
- Coverage 59.31% 59.16% -0.16%
+ Complexity 1345 1339 -6
============================================
Files 161 163 +2
Lines 8780 8810 +30
Branches 828 833 +5
============================================
+ Hits 5208 5212 +4
- Misses 3304 3332 +28
+ Partials 268 266 -2
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
|
||
Map<String, RankValue> getRemoteStoragePathRankValue(); | ||
void setFs(FileSystem fs); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We would better not assume our remote storage is only FileSystem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you elaborate more, or give me some other examples ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want to support S3 , COS, Redis or some DBs, it's not suitable for us.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I can remove this method, thinking that if I want to support redis or other databases, I need to re implement the sortPathByRankValue
method. It's a huge job.
Like discussion at dev mail list, we will freeze the code and cut 0.6 version branch in September 15, we will not merge this pr before I cut 0.6 version branch, are you ok? |
1 similar comment
Like discussion at dev mail list, we will freeze the code and cut 0.6 version branch in September 15, we will not merge this pr before I cut 0.6 version branch, are you ok? |
No problem, I got it. |
6f1cf29
to
3c095b3
Compare
Map<String, RemoteStorageInfo> getAppIdToRemoteStorageInfo(); | ||
|
||
Map<String, RankValue> getRemoteStoragePathRankValue(); | ||
List<Map.Entry<String, RankValue>> sortPathByRankValue(String path, Path testPath, long time, boolean isHealthy); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Path seems to be a concept of filesystem. It may be not general.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can support fileSystem at the beginning. If we need to support db, then there may be more things to consider, or we can add a parameter. If it is a file supported by fileSystem, we will turn on this switch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For S3
, COS
, it's OK, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At least We need consider the object Storage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
�IIRC, object storage such as s3 also has the concept of Path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But cos
doesn't seem to have this concept.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I know, the object storage also can be accessed by Hadoop Filesystem Interface (HCFS), like s3/oss/cos. So this is not a problem
But if we want to involve more storages which are not implemented by HCFS or speed up specified api which is only compatible with original storage api, maybe we should implement uniffle's own filesystem, it can refer to Flink.
Please let me know if I'm wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much for your reminder. Although I am not very impressed with this, I have used a similar method to read files in s3.
For object storage, there may be more S3, but for cos, I don't know this very well. Do you have any good suggestions? @jerqi |
Do the S3 have the concept of |
Could we support anomaly detection of hdfs files first? |
We should have more general interface. We can only implement part of them. |
Yes, the s3 compatible filesystem of Hadoop has implemented the abstraction of By the way, I implemented the hadoop filesystem api for internal object store in the past, so have some experience on it. |
We shouldn't relay on the filesystem of object storage too much. Some operations of object storage is slow. For example, |
I think we only need to provide a path that includes the remote, and the rest of the methods can be implemented by judging different storage systems. |
My doubt point is that the concept of path is not general one for the storage. |
So I did not use the concept of |
The concept of subsequent use of buckets or paths can be implemented through the readAndWrite interface. |
It's ok. |
Map<String, RemoteStorageInfo> getAppIdToRemoteStorageInfo(); | ||
|
||
Map<String, RankValue> getRemoteStoragePathRankValue(); | ||
List<Map.Entry<String, RankValue>> readAndWrite(String path); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a little doubt, why do the readAndWrite
become the only interface method for SelectStorageStrategy? I can rank by many methods, it's not necessary to read and write.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The meaning of the readAndWrite
interface is to perform anomaly detection of remote paths. For hdfs, reading and writing files is a more direct way to detect, so I took this name, but the real comparison method comes from sortPathByRankValue
, which I do not have. The reason for defining it as an interface is because the objects to be sorted may be different, and this is left to different storage methods for implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The meaning of the readAndWrite
interface is to perform anomaly detection of remote paths. For hdfs, reading and writing files is a more direct way to detect, so I took this name, but the real comparison method comes from sortPathByRankValue
, which I do not have. The reason for defining it as an interface is because the objects to be sorted may be different, and this is left to different storage methods for implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
readAndWrite
is not a good name. It's strange for the interface SelectStorageStrategy
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add a method pickStorage
or selectStorage
for this interface?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that might make more sense for this interface, I will add.
PTAL @jerqi |
import org.apache.uniffle.coordinator.LowestIOSampleCostSelectStorageStrategy.RankValue; | ||
|
||
public interface SelectStorageStrategy { | ||
|
||
RemoteStorageInfo pickRemoteStorage(String appId); | ||
List<Map.Entry<String, RankValue>> detectStorage(String path); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
path -> uri
?
Path is not general concept for storage. Maybe we should use uri.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, good suggestion.
Map<String, RemoteStorageInfo> getAppIdToRemoteStorageInfo(); | ||
|
||
Map<String, RankValue> getRemoteStoragePathRankValue(); | ||
List<Map.Entry<String, RankValue>> pickStorage(List<Map.Entry<String, RankValue>> pathList); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we give pathList
a better name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have any good advice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about uriRankings
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
uriList
or uris
may be better?
Why do the method pickStorage
return a list?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I modified it to let pickStorage
return to a selected path.
Map<String, RemoteStorageInfo> getAppIdToRemoteStorageInfo(); | ||
|
||
Map<String, RankValue> getRemoteStoragePathRankValue(); | ||
String pickStorage(List<Map.Entry<String, RankValue>> uris, String appId); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we return RemoteStorageInfo
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need, because the return value of pickStorage
only needs to be passed to incRemoteStorageCounter
, incRemoteStorageCounter
only needs String
type, and String
type is more general than RemoteStorageInfo
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need, because the return value of
pickStorage
only needs to be passed toincRemoteStorageCounter
,incRemoteStorageCounter
only needsString
type, andString
type is more general thanRemoteStorageInfo
.
String
is too general to describe what we want to return. RemoteStorageInfo
is more accurate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, updated.
Path testPath = new Path(rssTest); | ||
try { | ||
FileSystem fs = HadoopFilesystemProvider.getFilesystem(remotePath, hdfsConf); | ||
for (int j = 0; j < readAndWriteTimes; j++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we just detect the health of storage, it seems unnecessary for us write and read them multiple times.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For appBalance, it can indeed be changed once, but for io sampling, the number of times can be increased, perhaps we can set the parameter to 1 by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For appBalance, it shouldn't be a configuration option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
private final Configuration hdfsConf; | ||
private final int fileSize; | ||
private final int readAndWriteTimes; | ||
private boolean remotePathIsHealthy = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this variable remotePathIsHealthy
mean?
This class is a strategy. I can't get the meaning of this varibale.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This variable is used to determine whether the path is normal, and to pass the parameter into the method sortPathByRankValue
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This variable is used to determine whether the path is normal, and to pass the parameter into the method
sortPathByRankValue
.
One strategy should have multiple paths, every path is healthy, the variable is true?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know that there are many paths, but this attribute is applied to each path. The healthy path is processed by sortPathByRankValue
according to remotePathIsHealthy
as true, and corresponding logic is processed when it is false.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can look at the logic of sortPathByRankValue
. When we judge by this parameter, it is a normal path, and then deal with it accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the variable necessary to become member field?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
uris = uris.stream().filter(rv -> rv.getValue().getReadAndWriteTime().get() != Long.MAX_VALUE).sorted( | ||
Comparator.comparingInt(entry -> entry.getValue().getAppNum().get())).collect(Collectors.toList()); | ||
} else { | ||
// If all paths are unhealthy, assign paths according to the number of apps |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add some logs and metrics when all paths are unhealthy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I will add.
conf.getLong(CoordinatorConf.COORDINATOR_REMOTE_STORAGE_SCHEDULE_TIME), TimeUnit.MILLISECONDS); | ||
} | ||
|
||
public void checkReadAndWrite() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we put this logic into the class SelectStorageStrategy
?
Maybe we should have a abstract class to define the common behaviors of SelectStorageStrategy. |
This modification has done three things
|
This comment was marked as resolved.
This comment was marked as resolved.
PTAL @jerqi |
} | ||
} | ||
|
||
@Override |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we don't implement the method detectStorage
and pickStorage
, we can remove them in the abstract class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks @smallzhongfeng , wait for CI
https://github.com/apache/incubator-uniffle/actions/runs/3213433147 There is a flay test. Could you help me fix it? |
OK, I will fix it. |
What changes were proposed in this pull request?
The detection logic of abnormal paths is also added to the
APP_BALANCE
strategy.Why are the changes needed?
Anomaly detection can be performed when selecting a remote path to avoid selecting a problematic path.
Does this PR introduce any user-facing change?
The parameters of the original IO_SAMPLE strategy are reused and the names are changed.
rss.coordinator.remote.storage.select.strategy
supportAPP_BALANCE
andIO_SAMPLE
,APP_BALANCE
selection strategy based on the number of apps,IO_SAMPLE
selection strategy based on time consumption of reading and writing files.rss.coordinator.remote.storage.schedule.time
, if user chooseIO_SAMPLE
, file will be read and written at regular intervals.rss.coordinator.remote.storage.schedule.file.size
, the size of each read / write HDFS file.rss.coordinator.remote.storage.schedule.access.times
, number of times to read and write HDFS files.How was this patch tested?
Passed uts.