Skip to content

[HUDI-6220] Add HUDI code version to commit files and hoodie.properties.#8724

Open
prashantwason wants to merge 6 commits intoapache:masterfrom
prashantwason:pw_hudi_version
Open

[HUDI-6220] Add HUDI code version to commit files and hoodie.properties.#8724
prashantwason wants to merge 6 commits intoapache:masterfrom
prashantwason:pw_hudi_version

Conversation

@prashantwason
Copy link
Member

[HUDI-6220] Add HUDI code version to commit files and hoodie.properties.

Change Logs

  1. Added a class HUDIVersion which can be used to get HUDI version
  2. Added HUDI version and other engine specific information to HoodieCOmmitMetadata as extraMetadata
  3. Added HUDI version to the file comment for hoodie.properties

Impact

More debugging information available

Risk level (write none, low medium or high below)

None

Documentation Update

None

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

Copy link
Contributor

@danny0405 danny0405 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @bvaradar Can you help the review?

@bvaradar
Copy link
Contributor

will review this tomorrow.

@bvaradar bvaradar self-assigned this May 17, 2023
Copy link
Contributor

@bvaradar bvaradar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@prashantwason : Overall, looks good. Few questions.

} else {
LOG.warn("Unable to find driver bind address from spark config");
this.hostAddr = NetworkUtils.getHostname();
this.hostAddr = NetworkUtils.getHostAddr();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the rational for this change ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The NetworkUtils.getHostname function is returning the IP address. So renamed it to getHostAddr which is more appropriate.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I added the getHostName function which returns the hostname

@Override
public Map<String, String> getInfo() {
final Map<String, String> info = new HashMap<>();
System.getProperties().stringPropertyNames().forEach(property -> info.put(property, System.getProperty(property)));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of blindly copying all properties, we should do this selectively, as there may be sensitive information here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. we should be judicious here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These files are within the dataset itself. The dataset would contain sensitive information too. So I dont see the issue.

Anyways, I will filter them.

return parts;
}

public static void main(String[] args) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you remove this

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

@nsivabalan nsivabalan self-assigned this May 23, 2023
@Override
public Map<String, String> getInfo() {
final Map<String, String> info = new HashMap<>();
System.getProperties().stringPropertyNames().forEach(property -> info.put(property, System.getProperty(property)));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. we should be judicious here

info.put("spark.defaultParallelism", String.valueOf(javaSparkContext.defaultParallelism()));
info.put("spark.defaultMinPartitions", String.valueOf(javaSparkContext.defaultMinPartitions()));
info.put("spark.executor.instances", String.valueOf(javaSparkContext.getConf().get("spark.executor.instances")));
return info;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we also let users configure any hoodie write configs to be added to the extra metadata. since we don't serialize them anywhere, sometimes might come in handy during investigations. by default we don't need to add any hoodie write configs. but if user configures them, we can add them as well

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simply easier to add the entire hoodie config (without schema).

if (isValidChecksum(props)) {
final boolean isValidChecksum = isValidChecksum(props);
final String comment = String.format("Date=%s, host=%s, #properties=%d, hudi_version=%s",
Instant.now(), NetworkUtils.getHostname(), isValidChecksum ? props.size() : props.size() + 1, HoodieVersion.get());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you modified NetworkUtils.getHostname() just for this purpose, lets keep the existing method and add a new one.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NetworkUtils.getHostname() does not return hostname - it returns IP address. So this is a fix.

Copy link
Contributor

@bvaradar bvaradar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@yihua
Copy link
Contributor

yihua commented May 30, 2023

Hi @prashantwason could you check the build failure?

@prashantwason
Copy link
Member Author

@hudi-bot run azure

@nsivabalan nsivabalan added the priority:critical Production degraded; pipelines stalled label Jun 5, 2023
@codope codope force-pushed the pw_hudi_version branch from dd2b117 to 7b99dac Compare June 19, 2023 08:35
@prashantwason
Copy link
Member Author

@hudi-bot run azure

@nsivabalan nsivabalan added priority:high Significant impact; potential bugs and removed priority:critical Production degraded; pipelines stalled labels Jun 21, 2023
@prashantwason
Copy link
Member Author

@hudi-bot run azure

@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@bvaradar
Copy link
Contributor

bvaradar commented Oct 4, 2023

@prashantwason : Can you fix the conflictx and test so that we can land this ?

@github-actions github-actions bot added the size:L PR with lines of changes in (300, 1000] label Feb 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

priority:high Significant impact; potential bugs size:L PR with lines of changes in (300, 1000]

Projects

Status: 🆕 New

Development

Successfully merging this pull request may close these issues.

7 participants