Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Support user's app quota level limit #311

Merged
merged 13 commits into from
Nov 22, 2022

Conversation

smallzhongfeng
Copy link
Contributor

@smallzhongfeng smallzhongfeng commented Nov 11, 2022

What changes were proposed in this pull request?

For issue #211 and the design document https://docs.google.com/document/d/1MApSMFQgoS1VAoKbZjomqSRm0iTbSuKG1yvKNlWW65c/edit?usp=sharing

Why are the changes needed?

Better isolation of resources between different users.

Does this PR introduce any user-facing change?

Add config rss.coordinator.quota.default.app.num to set default app number each user and rss.coordinator.quota.default.path to set a path to record the number of apps that each user can run.

How was this patch tested?

Add uts.

@jerqi
Copy link
Contributor

jerqi commented Nov 11, 2022

Could you modify the description to attach your design document? Do the shuffle server have quota concept? cc @Gustfh

@@ -302,6 +302,7 @@ service CoordinatorServer {

message AppHeartBeatRequest {
string appId = 1;
string user = 2;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do the heartbeat need the user?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we register the shuffle, we put the user information to the shuffle server. Is it enough?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it isn't enough. Though user information to the shuffle server, we have to send server heartbeat to coordinator.So we use this app heartBeat to refresh each user' s app number which stored in coordinator.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For coordinator, could we pass the user information to the coordinator?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the heartbeat of the driver is used to send to the coordinator, the implementation is simple. If the heartbeat of the shuffleServer is used, you need to add a collection attribute in the heartbeat request of the shuffleServer, record the user and the corresponding app list, and then summarize them in the coordinator. Right ? And I haven't found the advantages of doing this for the time being. At present, this pr and our current production environment deployment have enough restrictions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't get your point. I feel it's unnecessary and strange that we put the user information to the heartbeat. We have register the user information to shuffle server. Why do we need heartbeat? We hope we can reduce the repeated information.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got your point. The coordinator needs to rely on the user to update the life cycle of the app, so the user is added to the heartbeat of the app. As for the repetitive problem you mentioned, do you have a better solution?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the heartbeat of the driver is used to send to the coordinator, the implementation is simple. If the heartbeat of the shuffleServer is used, you need to add a collection attribute in the heartbeat request of the shuffleServer, record the user and the corresponding app list, and then summarize them in the coordinator. Right ? And I haven't found the advantages of doing this for the time being. At present, this pr and our current production environment deployment have enough restrictions.

What's your restrictions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At present,restrictions is the number of apps that each user can submit.

@codecov-commenter
Copy link

codecov-commenter commented Nov 11, 2022

Codecov Report

Merging #311 (25b4b78) into master (79d2f54) will decrease coverage by 2.93%.
The diff coverage is 70.87%.

@@             Coverage Diff              @@
##             master     #311      +/-   ##
============================================
- Coverage     61.32%   58.39%   -2.94%     
- Complexity     1526     1552      +26     
============================================
  Files           186      193       +7     
  Lines          9441    10761    +1320     
  Branches        924      937      +13     
============================================
+ Hits           5790     6284     +494     
- Misses         3341     4103     +762     
- Partials        310      374      +64     
Impacted Files Coverage Δ
...apache/hadoop/mapreduce/v2/app/RssMRAppMaster.java 0.00% <0.00%> (ø)
...he/uniffle/client/impl/ShuffleWriteClientImpl.java 28.38% <0.00%> (-1.76%) ⬇️
...he/uniffle/coordinator/CoordinatorGrpcService.java 2.09% <0.00%> (-0.21%) ⬇️
.../org/apache/uniffle/coordinator/AccessManager.java 94.28% <75.00%> (-5.72%) ⬇️
...a/org/apache/uniffle/coordinator/QuotaManager.java 88.63% <88.63%> (ø)
...apache/uniffle/coordinator/ApplicationManager.java 84.93% <91.17%> (+1.13%) ⬆️
...apache/uniffle/coordinator/AccessQuotaChecker.java 95.83% <95.83%> (ø)
.../java/org/apache/uniffle/common/util/RssUtils.java 63.35% <100.00%> (+0.28%) ⬆️
.../apache/uniffle/coordinator/AccessCheckResult.java 92.30% <100.00%> (+8.97%) ⬆️
...ava/org/apache/uniffle/coordinator/AccessInfo.java 92.30% <100.00%> (+1.39%) ⬆️
... and 13 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@smallzhongfeng
Copy link
Contributor Author

Could you modify the description to attach your design document? Do the shuffle server have quota concept? cc @Gustfh

For now, shuffle server don't have quota concept.

@@ -203,12 +206,12 @@ public RssShuffleManager(SparkConf sparkConf, boolean isDriver) {
@Override
public <K, V, C> ShuffleHandle registerShuffle(int shuffleId, int numMaps, ShuffleDependency<K, V, C> dependency) {
// If yarn enable retry ApplicationMaster, appId will be not unique and shuffle data will be incorrect,
// appId + timestamp can avoid such problem,
// appId + uuid can avoid such problem,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need uuid?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't get the appId when we try Access, because the appId is generated after the RssManager is created. In order to support push down, we maintain the uuid as a substitute for the appId, and replace the uuid with the appId after the app heartbeat is reported to the coordinator.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can put the appId to the AcessInfo when we try to access coordinator.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean to generate the uuid on the driver?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My fault. We ignore that I can't get the appId in the construct method. Let me think twice.

@jerqi
Copy link
Contributor

jerqi commented Nov 12, 2022

If we have shuffle server quota, shuffle server will report the some metrics to the coordinator. The metrics will contain user quota info. Maybe we can reuse the information. @Gustfh WDTY?

@jerqi
Copy link
Contributor

jerqi commented Nov 12, 2022

Maybe we could add a rpc method called registerApplicationInfo for Coordinator.

@smallzhongfeng
Copy link
Contributor Author

smallzhongfeng commented Nov 12, 2022

Maybe we could add a rpc method called registerApplicationInfo for Coordinator.

You mean to add a request different from RssAppHeartbeat, right?

@jerqi
Copy link
Contributor

jerqi commented Nov 13, 2022

Maybe we could add a rpc method called registerApplicationInfo for Coordinator.

You mean to add a request different from RssAppHeartbeat, right?

Yes.

@smallzhongfeng
Copy link
Contributor Author

Maybe we could add a rpc method called registerApplicationInfo for Coordinator.

PTAL. @jerqi

public void appHeartbeat(
AppHeartBeatRequest request,
StreamObserver<AppHeartBeatResponse> responseObserver) {
public void registerApplicationInfo(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we rename appHeartbeat to registerApplicationInfo? I mean that we can add a method registerApplicationInfo.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could add a rpc method called registerApplicationInfo for Coordinator.

Didn't it say to add a new rpc method? If not, the shuffle server still get appHeartbeat contained the user's attribute

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CoordinatorCrpcService should have heartbeat and registerApplicationInfo methods at the same time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean to keep sendAppHeartbeat, and use registerApplicationInfo to send a request to the coordinator only once for recording. And still use sendAppHeartbeat later, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean to keep sendAppHeartbeat, and use registerApplicationInfo to send a request to the coordinator only once for recording. And still use sendAppHeartbeat later, right?

Yes. we use the registerApplicationInfo to coordinator and then send heartbeat to coordinator.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -282,6 +328,58 @@ private String getStorageHost(String remoteStoragePath) {
return storageHost;
}

public void detectUserResource() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the application manager involve the quota logic?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we can configure rss.coordinator.quota.default.path to controls whether to execute the logic of detectUserResource. If not configured, the logic will not be executed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a quota manager? Or will application manager manage both quota and application?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe quota manager will be better.

@@ -228,7 +254,8 @@ public void accessCluster(AccessClusterRequest request, StreamObserver<AccessClu
new AccessInfo(
request.getAccessId(),
Sets.newHashSet(request.getTagsList()),
request.getExtraPropertiesMap()
request.getExtraPropertiesMap(),
request.getUser()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a compatible feature? What will happen if a old client request a new coordinator service?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in 36fb720

@@ -39,6 +39,13 @@
<artifactId>spark-core_${scala.binary.version}</artifactId>
<version>${spark.version}</version>
<scope>provided</scope>
<exclusions>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot to remove this.

@@ -239,6 +239,7 @@ public Thread newThread(Runnable r) {
long heartbeatInterval = conf.getLong(RssMRConfig.RSS_HEARTBEAT_INTERVAL,
RssMRConfig.RSS_HEARTBEAT_INTERVAL_DEFAULT_VALUE);
long heartbeatTimeout = conf.getLong(RssMRConfig.RSS_HEARTBEAT_TIMEOUT, heartbeatInterval / 2);
client.registerApplicationInfo(appId, heartbeatTimeout, "user");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we remove this if mr don't need to register application info?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need this because applicationManager remove expired app need user. When our Spark does not use the AppQuota checker, it also needs register application info, so it will not have much impact here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that this isn't compatible feature. If we use old client to connect new service, something wrong will happen.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, because the old client does not have user information or method of registerApplicationInfo, app can only be updated at refreshApp, which I am compatible with.

@@ -222,6 +225,24 @@ public RssAppHeartBeatResponse sendAppHeartBeat(RssAppHeartBeatRequest request)
return response;
}

@Override
public RssApplicationInfoResponse sendApplicationInfo(RssApplicationInfoRequest request) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sendApplicationInfo -> registerApplicationInfo?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Contributor

@jerqi jerqi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, wait for CI

@jerqi jerqi changed the title [Feature] User's app quota [Feature] Support user's app quota limit Nov 21, 2022
@jerqi jerqi changed the title [Feature] Support user's app quota limit [Feature] Support user's app quota level limit Nov 21, 2022
@smallzhongfeng
Copy link
Contributor Author

Because in the discussion of the last meeting, the shuffle server also needs to be limited, but for the number of apps, the number of apps on a sever may not directly determine the load, so I have not developed this place. WDYT? @jerqi

@jerqi
Copy link
Contributor

jerqi commented Nov 21, 2022

Because in the discussion of the last meeting, the shuffle server also needs to be limited, but for the number of apps, the number of apps on a sever may not directly determine the load, so I have not developed this place. WDYT? @jerqi

It's ok for me. Wait for a moment. Let me see @Gustfh whether have another suggestion. If he don't reply us, I'll merge this pr tomorrow.

@jerqi
Copy link
Contributor

jerqi commented Nov 22, 2022

merged. Thanks @smallzhongfeng

@jerqi jerqi merged commit 8ff41a5 into apache:master Nov 22, 2022
@xianjingfeng
Copy link
Member

@jerqi @smallzhongfeng I think this feature should be an option. WDYT?

@jerqi
Copy link
Contributor

jerqi commented Nov 23, 2022

@jerqi @smallzhongfeng I think this feature should be an option. WDYT?

Yes, this is an option now. You can see

public static final ConfigOption<List<String>> COORDINATOR_ACCESS_CHECKERS = ConfigOptions

@xianjingfeng
Copy link
Member

Yes, this is an option now. You can see

But QuotaManager always be created, and it will create a thread pool.

@jerqi
Copy link
Contributor

jerqi commented Nov 23, 2022

Yes, this is an option now. You can see

But QuotaManager always be created, and it will create a thread pool.

You can raise a pr to fix this issue.

@smallzhongfeng
Copy link
Contributor Author

There are some unreasonable places here. I will fix it. @jerqi @xianjingfeng

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants