Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC-62] Diagnostic Reporter #6600

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

zhangyue19921010
Copy link
Contributor

@zhangyue19921010 zhangyue19921010 commented Sep 5, 2022

Change Logs

With the development of hudi, more and more users choose hudi to build their own ingestion pipelines to support real-time or batch upsert requirements.
Subsequently, some of them may ask the community for help, such as how to improve the performance of their hudi ingestion jobs? Why did their hudi jobs fail? etc.

For the volunteers in the hudi community, dealing with such issue, the volunteers may ask users to provide a list of information, including engine context, job configs,
data pattern, Spark UI, etc. Users need to spend extra effort to review their own jobs, collect metrics one buy one according to the list and give feedback to volunteers.
By the way, unexpected errors may occur at this time as users are manually collecting these information.

Obviously, there are relatively high communication costs for both volunteers and users.

On the other hand, for advanced users, they also need some way to efficiently understand the characteristics of their hudi tables, including data volume, upsert pattern, and so on.

In order to expose hudi table context more efficiently, this RFC propose a Diagnostic Reporter Tool.
This tool can be turned on as the final stage in ingestion job after commit which will collect common troubleshooting information including engine(take spark as example here) runtime information and generate a diagnostic report json file.

Or users can trigger this diagnostic reporter tool using hudi-cli to generate this report json file.

Impact

no impact

Risk level: none | low | medium | high

Choose one. If medium or high, explain what verification was done to mitigate the risks.

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@parisni
Copy link
Contributor

parisni commented Sep 7, 2022

Could we also mention the need of obfuscation of enterprise informations such ip address, hoatname, bucket, tables or columns names and so on ?

Copy link
Member

@codope codope left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhangyue19921010 Thanks for writing this RFC. A diagnostic reporter would be very useful for the community.
I had a broader scope in mind. Essentially, thinking in terms of system instead of tooling. We can run a web app to view these metrics in a more user-friendly manner (much like). We can split the implementation into multiple phases. In the first phase, we can implement whatever you suggested, gather more information based on adooption of this feature. Then in subsequent phases, we can work on developing the web app. Let me know what you think.
cc @xushiyan @prasannarajaperumal


## Proposers

- zhangyue19921010@163.com
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: just keep it as your github user id i.e. zhangyue19921010


JIRA: https://issues.apache.org/jira/browse/HUDI-4707

> Please keep the status updated in `rfc/README.md`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can remove this line

## Background
As we know, hudi already has its own unique metrics system and metadata framework. These information are very important for hudi job tuning or troubleshooting. Fo example:

1. Hudi will record the complete timeline in the .hoodie directory, including active timeline and archive timeline. From this we can trace the historical state of the hudi job.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
1. Hudi will record the complete timeline in the .hoodie directory, including active timeline and archive timeline. From this we can trace the historical state of the hudi job.
1. Hudi will record the complete timeline in the `.hoodie` directory, including active timeline and archive timeline. The timeline acts as an **event log** for the Hudi table using which one can track table snapshots.


1. Hudi will record the complete timeline in the .hoodie directory, including active timeline and archive timeline. From this we can trace the historical state of the hudi job.

2. Hudi metadata table which will records all the partitions, all the data files, etc
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
2. Hudi metadata table which will records all the partitions, all the data files, etc
2. Hudi metadata table which records partitions, data files, columns statistics, etc.

}
```

In order to expose hudi table context more efficiently, this RFC propose a Diagnostic Reporter Tool.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
In order to expose hudi table context more efficiently, this RFC propose a Diagnostic Reporter Tool.
In order to expose hudi table context more efficiently, this RFC proposes a Diagnostic Reporter System.

Comment on lines 172 to 174
We can quickly catch up the data distribution characteristics of the current hudi table through this part, which can be used to determine whether there is a small file problem
or a data hotspot problem, etc.
In addition, through data distribution and sample hudi keys, we can also use this information to help users choose the most appropriate index mode, for example:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

The second part is the `metadata information` related to the last k active commits, sorted by time as a List

```json
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am now thinking if we are going to zip .hoodie anyway, then adding these commit metadata may be redundant. It could still help save some time. But may not add too much value.
Instead, we could add some information about table level aggregate of what actions have been done in a tabular view. For example:

Action Instant Num Inserts Num Updates Total Errors More Stats
deltacommit 20220731224018987 100 100 1
replacecommit 20220731223129863 500 100 0

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch!
Retain the ability to get the entire .hoodie scene, but is disabled by default.
This zip file can provide the most detailed information, sometimes helpful for tough problems

we could add some information about table level aggregate

Maybe we can add these info at Meta information part, now maybe we can collect

{
  "configs":{
    "engine_config":{
      "spark.executor.memoryOverhead":"3072"
    },
    "hoodie_config":{
      "hoodie.datasource.write.keygenerator.class":"org.apache.hudi.keygen.ComplexKeyGenerator"
    }
  },
  "commits":[
    {
      "20220818134233973.commit":{
        "totalNumWrites":123352,
        "totalNumDeletes":0,
        "totalNumUpdateWrites":0,
        "totalNumInserts":123352,
        "totalWriteBytes":4675371,
        "totalWriteErrors":0,
        "totalLogRecords":0,
        "totalLogFilesCompacted":0,
        "totalLogSizeCompacted":0,
        "totalUpdatedRecordsCompacted":0,
        "totalScanTime":0,
        "totalCreateTime":21051,
        "totalUpsertTime":0
      }
    }
  ]
}

2. Hudi metadata table which will records all the partitions, all the data files, etc

3. Each commit of hudi records various metadata information and runtime metrics currently written, such as:
```json
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can remove the json blob from background. Just mention what kind of stats commit metadata already has, and maybe also point to the class.

```


## Rollout/Adoption Plan
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fill these details. Especially, any new config, effect on performance if any.


## Implementation

This Diagnostic Reporter Tool will go through whole hudi table and generate a report json file which contains all the necessary information. Also this tool will package .hoodie folder as a zip compressed file.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also allow users to configure application logs dir and zip the logs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Emmm, Of course we can collect Driver logs and all Executor logs. The only worry is that the log volume is too large :<

Copy link
Contributor

@YuweiXiao YuweiXiao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments, looking forward to this feature!


## Implementation

This Diagnostic Reporter Tool will go through whole hudi table and generate a report json file which contains all the necessary information. Also this tool will package .hoodie folder as a zip compressed file.
Copy link
Contributor

@YuweiXiao YuweiXiao Sep 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe provide an options to compress active timeline related files only? The table may have been running for a long time and the while .hoodie maybe huge.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch. added in this rfc

"data_volume":"1000GB",
"partition_keys":"year/month/day"
},
"max_data_volume_per_partition":{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will we record volume info for all partition, or just min/max partition?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just min/max partitions are good enough maybe.

> The JSON is available for both running applications, and in the history server. The endpoints are mounted at /api/v1.
> For example, for the history server, they would typically be accessible at http://<server-url>:18080/api/v1, and for a running application, at http://localhost:4040/api/v1.

So that we can collect any spark runtime information we want. For example collect all the stages and collect the corresponding execution time
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though it may not related to engine runtime information, could we keep track of instants produced by the current engine and record them somewhere?

Because in many cases, users run multiple clients without OCC turned on. When error occurred, the exception message cannot reveal the root cause (i.e., multi-client issue).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure for each commit, this Diagnostic Reporter will generate report json with runtime info at .hoodie/report/instant-time/

@zhangyue19921010
Copy link
Contributor Author

Hi @codope And @YuweiXiao Really appreciate for your efforts here!
All comments are addressed. PTAL!

@zhangyue19921010
Copy link
Contributor Author

zhangyue19921010 commented Sep 15, 2022

Could we also mention the need of obfuscation of enterprise informations such ip address, hoatname, bucket, tables or columns names and so on ?

Emm these info looks like pretty sensitive. I am not sure we can collect these info in public report :<

Copy link
Member

@xushiyan xushiyan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhangyue19921010 thanks for taking this up! Some high level thoughts:

  • hudi commit metadata vs hudi metrics: if users enable diagnostic reporter, should we have a config to include metrics reporter's data? metrics system is good at showing the trends but hard to cross-check against commit metadata. so regardless of enabling metrics reporter or not, diagnostic reporter can collect metrics and save to report dir, just like a csv/json metrics reporter. We can also refine what goes to metrics and what goes to commit metadata, to keep the responsibilities clear and reporting data organized.
  • consolidate with error table: RFC-20 this is a long-pending feature that also aims to assist investigation. diagnostic reporter should be aware of error table settings and zip the error table if configured so. Size could be a concern, so configuration can be given to zip the whole table, or sample records, or skip error table completely. Also it requires some config to allow masking any fields. Taking a step further, we can also make error table one of the diagnostic reporting features. They have similar storage structures: can be local to the hudi table or global to the whole platform.
  • work with metadata table: you've already mentioned collecting stats by listing the file system. diagnostic reporter should also be aware of the presence of metadata table and zip the table or extract relevant data - fallback to file system listing if not present.

@zhangyue19921010
Copy link
Contributor Author

@zhangyue19921010 thanks for taking this up! Some high level thoughts:

  • hudi commit metadata vs hudi metrics: if users enable diagnostic reporter, should we have a config to include metrics reporter's data? metrics system is good at showing the trends but hard to cross-check against commit metadata. so regardless of enabling metrics reporter or not, diagnostic reporter can collect metrics and save to report dir, just like a csv/json metrics reporter. We can also refine what goes to metrics and what goes to commit metadata, to keep the responsibilities clear and reporting data organized.
  • consolidate with error table: RFC-20 this is a long-pending feature that also aims to assist investigation. diagnostic reporter should be aware of error table settings and zip the error table if configured so. Size could be a concern, so configuration can be given to zip the whole table, or sample records, or skip error table completely. Also it requires some config to allow masking any fields. Taking a step further, we can also make error table one of the diagnostic reporting features. They have similar storage structures: can be local to the hudi table or global to the whole platform.
  • work with metadata table: you've already mentioned collecting stats by listing the file system. diagnostic reporter should also be aware of the presence of metadata table and zip the table or extract relevant data - fallback to file system listing if not present.

Thanks @xushiyan for your advice!
Will have a deep look and expand this rfc asap!

@nsivabalan nsivabalan added priority:critical production down; pipelines stalled; Need help asap. and removed priority:blocker labels Jan 24, 2023
@vinothchandar vinothchandar self-assigned this Feb 16, 2023
@github-actions github-actions bot added the size:L PR with lines of changes in (300, 1000] label Feb 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority:critical production down; pipelines stalled; Need help asap. rfc size:L PR with lines of changes in (300, 1000]
Projects
Status: 🏁 Triaged
Status: 🆕 New
Development

Successfully merging this pull request may close these issues.

None yet

8 participants