Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-31510][yarn] Use getMemorySize instead of getMemory. #22207

Closed
wants to merge 1 commit into from

Conversation

slfan1989
Copy link
Contributor

@slfan1989 slfan1989 commented Mar 18, 2023

What is the purpose of the change

JIRA: FLINK-31510. Use getMemorySize instead of getMemory

InYARN-4844, such a situation is described. We use int to represent memory (Unit MB). If a machine has 210GB of memory, the conversion to MB is 215040MB. If we have 10K machines, then 215040*10000=2150400000 > 2147483647 , accumulating memory will return a negative number.

In YARN-4844, use getMemorySize instead of getMemory, use Long type to represent memory instead of using Int type to represent memory.

Resource#getMemory

/**
   * This method is DEPRECATED:
   * Use {@link Resource#getMemorySize()} instead
   *
   * Get <em>memory</em> of the resource. Note - while memory has
   * never had a unit specified, all YARN configurations have specified memory
   * in MB. The assumption has been that the daemons and applications are always
   * using the same units. With the introduction of the ResourceInformation
   * class we have support for units - so this function will continue to return
   * memory but in the units of MB
   *
   * @return <em>memory</em>(in MB) of the resource
   */
  @Public
  @Deprecated
  public abstract int getMemory();

Resource#getMemorySize

/**
   * Get <em>memory</em> of the resource. Note - while memory has
   * never had a unit specified, all YARN configurations have specified memory
   * in MB. The assumption has been that the daemons and applications are always
   * using the same units. With the introduction of the ResourceInformation
   * class we have support for units - so this function will continue to return
   * memory but in the units of MB
   *
   * @return <em>memory</em> of the resource
   */
  @Public
  @Stable
  public long getMemorySize();

There is such a code in Flink to accumulate the memory returned by NM. This code may return a negative number.

YarnClusterDescriptor#getCurrentFreeClusterResources

int totalMemory = 0;
int totalCores = 0;
for (NodeReport rep : nodes) {
    final Resource res = rep.getCapability();
    totalMemory += res.getMemory();
    totalCores += res.getVirtualCores();
    ps.format(format, "NodeID", rep.getNodeId());
    ps.format(format, "Memory", getDisplayMemory(res.getMemory()));
    ps.format(format, "vCores", res.getVirtualCores());
    ps.format(format, "HealthReport", rep.getHealthReport());
    ps.format(format, "Containers", rep.getNumContainers());
    ps.println("+---------------------------------------+");
}

Brief change log

Use getMemorySize instead of getMemory.

Verifying this change

Manually verified.

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): (yes / no)
  • The public API, i.e., is any changed class annotated with @Public(Evolving): (yes / no)
  • The serializers: (yes / no / don't know)
  • The runtime per-record code paths (performance sensitive): (yes / no / don't know)
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: (yes / no / don't know)
  • The S3 file system connector: (yes / no / don't know)

Documentation

  • Does this pull request introduce a new feature? no

@flinkbot
Copy link
Collaborator

flinkbot commented Mar 18, 2023

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

Copy link
Member

@chucheng92 chucheng92 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.(non-binding)

@@ -670,22 +670,22 @@ private ClusterSpecification validateClusterResources(

final String note =
"Please check the 'yarn.scheduler.maximum-allocation-mb' and the 'yarn.nodemanager.resource.memory-mb' configuration values\n";
if (jobManagerMemoryMb > maximumResourceCapability.getMemory()) {
if (jobManagerMemoryMb > maximumResourceCapability.getMemorySize()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi, i think here memory unit is MB. so we can use int, why must long ?

@reswqa reswqa changed the title FLINK-31510. Use getMemorySize instead of getMemory. [FLINK-31510][yarn] Use getMemorySize instead of getMemory. Mar 22, 2023
@reswqa
Copy link
Member

reswqa commented Mar 23, 2023

Thanks @slfan1989 for creating this.

How did you manually verify this issue, could you describe it in detail?

By the way, would you mind resolving conflict in YarnClusterDescriptor?

@slfan1989
Copy link
Contributor Author

Thanks @slfan1989 for creating this.

How did you manually verify this issue, could you describe it in detail?

By the way, would you mind resolving conflict in YarnClusterDescriptor?

@reswqa Thank you very much for your suggestion, I will modify the information as soon as possible

@slfan1989
Copy link
Contributor Author

@reswqa Can you help review this PR again? Thank you very much! I explained the reason for possible negative numbers.

Copy link
Member

@reswqa reswqa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tanks @slfan1989 for the update, I only have one question.

Copy link
Member

@reswqa reswqa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @slfan1989 for the update, would you mind squashing all to only one commit with the message like:

[FLINK-31510][yarn] Replace deprecated getMemory by getMemorySize

@slfan1989
Copy link
Contributor Author

Thanks @slfan1989 for the update, would you mind squashing all to only one commit with the message like:

@reswqa Thank you for your help to review the code! I will rebase and submit.

Copy link
Member

@reswqa reswqa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @slfan1989, LGTM.

@reswqa reswqa closed this in 0c4db2d Apr 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants