Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDDS-8874. Use OM Service ID instead of OM Node ID for container ownership #4922

Merged
merged 5 commits into from
Oct 26, 2023

Conversation

ivandika3
Copy link
Contributor

@ivandika3 ivandika3 commented Jun 17, 2023

What changes were proposed in this pull request?

Currently, OM passes OzoneManager#getOmNodeId as the owner argument in OMKeyRequest#allocateBlock. This means that the created container's owner is based on the OM HA leader's Node ID specified in the ozone configuration (i.e. ozone.om.node.id).

In the case where there are multiple OMs using the same SCM. If the Node IDs configured are not unique across the SCM cluster, there might be a situation where a single container contains blocks owned by multiple OMs. This should be fine for now, but in the future if we want to separate the containers based on the Storage Type, this collision might raise tricky situations

For example for configuration (omitting om addresses)

<property>
  <name>ozone.om.service.ids</name>
  <value>service1,service2</value>
</property>
<property>
  <name>ozone.om.nodes.service1</name>
  <value>om1,om2,om3</value>
</property> 
<property>
  <name>ozone.om.nodes.service2</name>
  <value>om1,om2,om3</value>
</property>

If the OM leader of both service1 and service2 has the same node ID (e.g. om1), both OM will use the configured node ID as the owner when allocating block, and SCM will assume that the request comes from the same owner and allocate the same container for the blocks.

We should find a way to pick a unique ID to pass to the allocateBlock call. Currently a valid option is to use omUuid since the chance of collisions would be smaller. This patch uses omUuid as the owner of SCM container.

However, a better option for the container owner is to pick a unique ID for a single OM HA service (instead of node ID). This has possible additional advantages:

When the OM leader change, the SCM does not need to create a new container for the new blocks
Fewer number of OPEN containers and all containers in SCM overall

Unfortunately, I have not found such an ID available. The candidates I have considered:

  • OM's ClusterID is tied to the SCM so it doesn't make sense to use it as container owner.
  • OM Service ID cannot be used since there doesn't seem to be any uniqueness guarantee across OM HA services. (Edit: we decided to use OM Service ID for now)
  • OM Service Ratis Group ID is also derived from the MD5 hash of OM HA Service ID so it cannot be used.

Another consideration is to migrate the existing containers's owner to the new naming conventions. This might be able to be done by using some container task. This issue can be handled in another ticket.

Any further suggestions are greatly appreciated.

Edit: We decided to use OM Service ID for the container ownership since they are supposed to be unique, albeit not fully enforced.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-8874

How was this patch tested?

Existing unit tests.

@ivandika3
Copy link
Contributor Author

ivandika3 commented Jun 17, 2023

Hi @fapifta @ChenSammi, could you help take a look and let me know what you think? Thank you.

@ivandika3 ivandika3 marked this pull request as ready for review June 19, 2023 08:22
@sodonnel
Copy link
Contributor

This patch uses omUuid as the owner of SCM container.

However, a better option for the container owner is to pick a unique ID for the a single OM HA service (instead of nodeID).

I think the omID needs to be the same for all om instances in a HA group. There should not be a need to close all containers on SCM just because there is an OM leader switch.

I am also not sure about the need to have different containers for different OM HA groups sharing a single SCM. OM just deals with blocks, SCM groups blocks into containers. I am not sure if we need to be sure that the blocks from OM group 1 do not mix with blocks from OM group 2 in the same containers.

We don't have storage policies in Ozone at the moment, and even when we do, there is going to be a mix of "hot" and "cold" data in the same containers. Having blocks from two different OM HA groups in the same container doesn't sound much different to that.

@guohao-rosicky
Copy link
Contributor

hi, @ivandika3 , thanks for work this. For example, for a 3-node om ha, does the omId change when the current om changes in the leader?

@ivandika3
Copy link
Contributor Author

@sodonnel Thank you for your input.

I think the omID needs to be the same for all om instances in a HA group. There should not be a need to close all containers on SCM just because there is an OM leader switch.

I agree. Maybe instead of coming up with a new unique id I can find a way to synchronize the omID instead and use it as the OM HA service identifier. Please let me know what you think.

I am also not sure about the need to have different containers for different OM HA groups sharing a single SCM.

I also have doubts regarding the necessity of this. Since I'm not really well-versed in the SCM-side yet, I decided to take the conservative approach and try to maximize the resource isolation (i.e. distinct containers for OM HA groups). I did some tests manually in my cluster, currently there was no apparent issue. Anyway I think the single unique ID for a single OM HA group is still a worthwhile endeavor.

@ivandika3
Copy link
Contributor Author

ivandika3 commented Jun 19, 2023

@guohao-rosicky Thank you for addressing this.

for a 3-node om ha, does the omId change when the current om changes in the leader?

Do you mean the omId passed to the OMKeyRequest#allocateBlock? If so, yes, it will change when there is a new OM leader from what I understand from the code. AFAIK, only OM leader is able to communicate with SCM. Currently, the OM node ID is used as container owner only during write (e.g. ContainerManager#getMatchingContainer used in WritableContainerProvider). Please let me know if you have any more input.

@sodonnel
Copy link
Contributor

I did some tests manually in my cluster, currently there was no apparent issue

As things stand, EC containers will not use the owner passed in when selecting a pipeline. @guohao-rosicky has a PR up to address it, but I feel the solution is more complex, as EC pipelines and containers are 1 to 1, and essentially the same thing. With EC it would mean maintaining a complete set of new pipelines per owner and per EC scheme.

@ivandika3
Copy link
Contributor Author

ivandika3 commented Jun 19, 2023

@sodonnel I see, currently our cluster does not use EC yet. Therefore, I'm not really familiar with it. So you mean that one EC pipeline only maps to one EC container? Does that mean that the owner should be on the pipeline, instead of the container level?

@sodonnel
Copy link
Contributor

So you mean that one EC pipeline only maps to one EC container? Does that mean that the owner should be on the pipeline, instead of the container level?

Yes, a pipeline is created and a container assigned immediately. When that container is full, the pipeline and container are both closed. EC pipelines creation is cheap, as its really just picking a set of nodes to use.

There are also limits on the number of EC pipelines. An EC key for 3-2 cannot go into the same container as a 6-3 container, so there is a group of pipelines per EC scheme. If we add owner, then there will need to be a group of pipelines per owner and per EC scheme, so each owner would double the EC pipelines and potentially open containers in the system.

That is why we should question if the separation is really needed before jumping into doing it.

@ivandika3
Copy link
Contributor Author

ivandika3 commented Jun 20, 2023

If we add owner, then there will need to be a group of pipelines per owner and per EC scheme, so each owner would double the EC pipelines and potentially open containers in the system.

@sodonnel Thank you for the info regarding EC containers and pipelines. Yes, I don't think moving the owner to pipeline level should be done for now.

I think the omID needs to be the same for all om instances in a HA group. There should not be a need to close all containers on SCM just because there is an OM leader switch.

Coming back to this, currently omID (UUID) is used to uniquely differentiate between the OMs as certificate clients. Therefore, I don't think this should be set to the same value across all the OMs. This might require us to have a new UUID representing the OM HA Service.

The way I see it, there are two possible issues we are trying to resolve with regards SCM container owners:

  1. Prevent sharing containers between OM HA services (e.g. prevent lock contention in the shared container)
  2. Reduce the number of containers by reducing the number of possible distinct container owners

I believe that by changing the owner to omID (UUID), it will resolve the first issue. If we only want to resolve the second issue, the easiest way is to use the OM Service ID for the container owner. If we want to resolve both, I think it requires a new unique UUID to represent the OM HA Service, which requires to come up with a more involved solution (e.g. how to synchronize the UUID between OMs, etc).

We can limit the scope for this patch to achieve the first goal. @sodonnel @guohao-rosicky Please let me know what you think.

@ivandika3
Copy link
Contributor Author

ivandika3 commented Jun 27, 2023

From the code comment regarding Container Owners in ContainerStateMap

 * 2. Owners - Each instance of Name service, for example, Namenode of HDFS or
 * Ozone Manager (OM) of Ozone or CBlockServer --  is an owner. It is
 * possible to have many OMs for a Ozone cluster and only one SCM. But SCM
 * keeps the data from each OM in separate bucket, never mixing them. To
 * write data, often we have to find all open containers for a specific owner.

It seems the intent is to have one owner per Name service, which is comparable to OM Service ID, instead of OM Node ID. We can change it in OM Service ID instead. In our cluster, we finally decided to use OM Service ID instead of OM ID (uuid) for a few reasons (only with respect to Ratis containers):

  • For each container, SCM creates a minimum number of containers first for each allocateBlock call (i.e. ContainerManagerImpl#getOpenContainerCountPerPipeline). If we use OM ID, when the OM leader change, it will create these initial containers as well. This will increase the number of containers, which might affect SCM performance. Instead, if we use Service ID, we only create the initial containers once.

  • Using OM ID (uuid) will make the OM container names very confusing since from the UUID it's hard to know which OM cluster the container comes from. Using Service ID is easy since we know which service comes from (e.g. service1, service2 etc)

  • As long as we ensure that Service IDs of OM services using the same SCM is unique, it should be safe to share SCM. Duplicates of Service ID is harder to do since for Ozone client configuration, Service IDs are supposed to be unique.

@ivandika3 ivandika3 changed the title Use OM Storage ID instead of OM Node ID for container ownership Use OM Service ID instead of OM Node ID for container ownership Aug 8, 2023
@ivandika3 ivandika3 changed the title Use OM Service ID instead of OM Node ID for container ownership HDDS-8874. Use OM Service ID instead of OM Node ID for container ownership Aug 31, 2023
@adoroszlai
Copy link
Contributor

@ivandika3 @sodonnel what is the next step for this PR? Do we need more reviews, a design doc, test coverage, upgrade strategy, or something else?

@sodonnel
Copy link
Contributor

I am not too sure about the use cases around multiple OMs sharing a single SCM. However from the discussion here it does sound as though the owner passed from OM to SCM even in a single HA OM group using SCM is wrong, as SCM will see a different ID for each OM in the HA group. What we really want is a single ID shared by all 3 OMs in a HA group.

Therefore we should probably fix that part in this PR.

@ivandika3
Copy link
Contributor Author

ivandika3 commented Oct 17, 2023

What we really want is a single ID shared by all 3 OMs in a HA group.

@adoroszlai @sodonnel I think OM Service ID should suffice as the container owner for now.

FYI, our cluster have two OM services due to stability issues when sharing a single OM service among different workloads which will affect the stability of one critical use case. Therefore, we spinned up a new OM service to isolate a specific critical use case from being affected by other workloads. Container owner is just a possible contention that we identified when we were analyzing the safety of having two OM services sharing a single SCM.

Regarding the plan to support the ec container to the owner, I currently do not have enough prerequisite EC knowledge and bandwidth to address this. This can be addressed in different ticket after this is merged, or we can close this ticket and change the container's owner once the solution of "support the ec container to the owner" is already finalized. I'll leave it to your discretions @adoroszlai @sodonnel @guohao-rosicky.

@kerneltime
Copy link
Contributor

The ability for multiple OM replication groups to use the same SCM and datanode set is exciting and is a good step towards sharding OM. OM Service ID should be a good proxy for the replicated OM set. I will go over the PR in the coming days.
Thank you for these changes.

@errose28
Copy link
Contributor

LGTM. I think using OM service ID as the container owner makes sense. There's been a fair amount of discussion here, so I'll leave it open if others have further comments.

Just one minor comment, in OMKeyRequest#allocateBlock, we can change the omID parameter name to serviceID to reflect the new usage.

@ivandika3
Copy link
Contributor Author

Thank you for the review @errose28.

Just one minor comment, in OMKeyRequest#allocateBlock, we can change the omID parameter name to serviceID to reflect the new usage.

Updated.

@adoroszlai
Copy link
Contributor

Thanks @ivandika3 for updating the patch. There is a checkstyle error:

hadoop-ozone/ozone-manager/src/test/java/org/apache/hadoop/ozone/om/request/key/TestOMKeyRequest.java
 170: Line is longer than 80 characters (found 81).

https://github.com/ivandika3/ozone/actions/runs/6597636911/job/17924816077#step:7:11

@ivandika3
Copy link
Contributor Author

@adoroszlai Apologies, I have fixed the checkstyle failure. Thanks for checking it.

@adoroszlai adoroszlai merged commit 04f13ac into apache:master Oct 26, 2023
31 checks passed
@adoroszlai
Copy link
Contributor

Thanks @ivandika3 for the patch, @errose28, @guohao-rosicky, @kerneltime, @sodonnel for the review.

ibrusentsev pushed a commit to ibrusentsev/ozone that referenced this pull request Nov 14, 2023
@ivandika3 ivandika3 self-assigned this Apr 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants