Skip to content

[CELEBORN-1597][CIP-11] Implement TagsManager#2739

Closed
s0nskar wants to merge 7 commits intoapache:mainfrom
s0nskar:tags_manager
Closed

[CELEBORN-1597][CIP-11] Implement TagsManager#2739
s0nskar wants to merge 7 commits intoapache:mainfrom
s0nskar:tags_manager

Conversation

@s0nskar
Copy link
Contributor

@s0nskar s0nskar commented Sep 13, 2024

What changes were proposed in this pull request?

Added tags manager which will be responsible for managing worker tags. This will be used in the follow up PRs

Why are the changes needed?

https://cwiki.apache.org/confluence/display/CELEBORN/CIP-11+Supporting+Tags+in+Celeborn

Does this PR introduce any user-facing change?

NA

How was this patch tested?

Added UTs

type Tag = String
type WorkerId = String

type TagsStore = ConcurrentHashMap[Tag, Set[WorkerId]]
Copy link
Contributor

@mridulm mridulm Sep 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the value of TagStore is not exposed outside of TagsManager for both read/write, make this a mutable MT-safe data structure ? like ConcurrentHashMap.newKeySet() for some such for ex ?

It will avoid perf degradation if mutations keep happening.

Suggested change
type TagsStore = ConcurrentHashMap[Tag, Set[WorkerId]]
import java.util.{Set => JSet}
type TagsStore = ConcurrentHashMap[Tag, JSet[WorkerId]]

Copy link
Contributor Author

@s0nskar s0nskar Sep 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will avoid perf degradation if mutations keep happening.

@mridulm I've updated the code to use JSet, although would like to read more about this, is there any references for this.

}

def getTagsForWorker(worker: WorkerInfo): Set[Tag] = {
tagStore.asScala.filter(_._2.contains(worker.host)).keySet.toSet
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be relatively expensive - are we expecting it to be a common operation ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this won't be a common operation and will only used if someone calls list tags for worker API. Operation cost would be upper bounded by O(tags) * O(log (numWorkers))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, if this is not a common operation, we can keep this code.

However, if this API is frequently used, O^2 + large number of celeborn workers may cause performance degradation. We can use additional data structures to assist in querying.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree, we can build a reverse index mapping of worker -> tags if needed in future.

}

def removeTagFromWorker(tag: Tag, worker: WorkerInfo): Unit = {
val workerId = worker.host
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The worker's host cannot be used as the unique identifier of a worker. It is more appropriate to use the WorkerInfo#toUniqueId method.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or we can just put the WorkerInfo, it's comparable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zwangsheng Thats a good point but WorkerInfo#toUniqueId contains too much information. IMO for the external interface like File / DB / API passing the worker unique id can become cumbersome to manage for admins while host names are easy to manage.

Another option to uniquely identify a worker can be workerHost:rpcPort which will also be unique.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@s0nskar Sorry for the late reply, using just host may cause problems in some users' case.

Use workerHost:rpcPort as the unique key seems reasonable to me

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated the PR to use toUniqueId as it is predefined.

Filed a follow up task to decide if we should use workerHost:rpcPort or something better.

s0nskar and others added 4 commits September 16, 2024 17:19
Co-authored-by: Mridul Muralidharan <1591700+mridulm@users.noreply.github.com>
@codecov
Copy link

codecov bot commented Sep 16, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 33.13%. Comparing base (59a3995) to head (119adc3).
Report is 14 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2739      +/-   ##
==========================================
+ Coverage   33.11%   33.13%   +0.02%     
==========================================
  Files         314      314              
  Lines       18373    18435      +62     
  Branches     1683     1691       +8     
==========================================
+ Hits         6082     6106      +24     
- Misses      11951    11988      +37     
- Partials      340      341       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

import org.apache.celeborn.service.deploy.master.tags.TagsManager.{Tag, TagsStore, WorkerId}

object TagsManager {
type Tag = String
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these three types alias are not needed. These types are not complex. This won't help others to understand this code.

Copy link
Contributor Author

@s0nskar s0nskar Sep 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK, will remove these.

type Tag = String
type WorkerId = String

type TagsStore = ConcurrentHashMap[Tag, JSet[WorkerId]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The value type of the "JSet" should be WorkerInfo.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is a slight disconnect. TagsManager will store the mapping for tags to Set{workerId}. It is possible the tagged worker might not be present/running currently.

TagsManger responsibility is just to filter out the tagged worker from given a list of WorkerInfo.

val workerId = worker.host
val workers = tagStore.computeIfAbsent(tag, addNewTagFunc)
logInfo(s"Adding Tag $tag to worker $workerId")
workers.add(workerId)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have already got the worker, just storing the worker will be better. The worker info had override equals methods, it can be put into a set.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@FMX This will be manager by external interfaces as well where user will have to add a unique identifier for a worker. We are comparing worker with the identifier. wdyt?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Look at your method declaration, you have already got the workerInfo you need. In this tag manager, storing a string as a worker's identifier is unnecessary.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that you want to convey that we need to find the right worker information. However, finding the correct work info can be accomplished outside of this tag manager. For example, we can use a worker's host or a worker's unique ID.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh i implemented wrong definition, it should have been

def addTagToWorker(tag: Tag, worker: WorkerId)

I will update the same for this and other functions.

@s0nskar
Copy link
Contributor Author

s0nskar commented Sep 26, 2024

@mridulm @FMX @zwangsheng updated the PR, PTAL.

Copy link
Contributor

@FMX FMX left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Although this implementation may have performance issues, it's OK. We can optimize performance if we encounter.

@FMX
Copy link
Contributor

FMX commented Sep 27, 2024

Merged into main(v0.6.0).

@FMX FMX closed this in 3fc74ec Sep 27, 2024
@s0nskar
Copy link
Contributor Author

s0nskar commented Sep 27, 2024

Although this implementation may have performance issues

@FMX can you mention the issues, i will try to address those in follow up PR.

@s0nskar s0nskar deleted the tags_manager branch September 27, 2024 07:59
HolyLow pushed a commit to HolyLow/celeborn that referenced this pull request Oct 14, 2024
### What changes were proposed in this pull request?

Added tags manager which will be responsible for managing worker tags. This will be used in the follow up PRs

### Why are the changes needed?

https://cwiki.apache.org/confluence/display/CELEBORN/CIP-11+Supporting+Tags+in+Celeborn

### Does this PR introduce _any_ user-facing change?

NA

### How was this patch tested?
Added UTs

Closes apache#2739 from s0nskar/tags_manager.

Authored-by: Sanskar Modi <sanskarmodi97@gmail.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants