New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC-62] Diagnostic Reporter #6600
base: master
Are you sure you want to change the base?
[RFC-62] Diagnostic Reporter #6600
Conversation
Could we also mention the need of obfuscation of enterprise informations such ip address, hoatname, bucket, tables or columns names and so on ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zhangyue19921010 Thanks for writing this RFC. A diagnostic reporter would be very useful for the community.
I had a broader scope in mind. Essentially, thinking in terms of system instead of tooling. We can run a web app to view these metrics in a more user-friendly manner (much like). We can split the implementation into multiple phases. In the first phase, we can implement whatever you suggested, gather more information based on adooption of this feature. Then in subsequent phases, we can work on developing the web app. Let me know what you think.
cc @xushiyan @prasannarajaperumal
rfc/rfc-62/rfc-62.md
Outdated
|
||
## Proposers | ||
|
||
- zhangyue19921010@163.com |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: just keep it as your github user id i.e. zhangyue19921010
rfc/rfc-62/rfc-62.md
Outdated
|
||
JIRA: https://issues.apache.org/jira/browse/HUDI-4707 | ||
|
||
> Please keep the status updated in `rfc/README.md`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can remove this line
rfc/rfc-62/rfc-62.md
Outdated
## Background | ||
As we know, hudi already has its own unique metrics system and metadata framework. These information are very important for hudi job tuning or troubleshooting. Fo example: | ||
|
||
1. Hudi will record the complete timeline in the .hoodie directory, including active timeline and archive timeline. From this we can trace the historical state of the hudi job. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1. Hudi will record the complete timeline in the .hoodie directory, including active timeline and archive timeline. From this we can trace the historical state of the hudi job. | |
1. Hudi will record the complete timeline in the `.hoodie` directory, including active timeline and archive timeline. The timeline acts as an **event log** for the Hudi table using which one can track table snapshots. |
rfc/rfc-62/rfc-62.md
Outdated
|
||
1. Hudi will record the complete timeline in the .hoodie directory, including active timeline and archive timeline. From this we can trace the historical state of the hudi job. | ||
|
||
2. Hudi metadata table which will records all the partitions, all the data files, etc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2. Hudi metadata table which will records all the partitions, all the data files, etc | |
2. Hudi metadata table which records partitions, data files, columns statistics, etc. |
rfc/rfc-62/rfc-62.md
Outdated
} | ||
``` | ||
|
||
In order to expose hudi table context more efficiently, this RFC propose a Diagnostic Reporter Tool. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In order to expose hudi table context more efficiently, this RFC propose a Diagnostic Reporter Tool. | |
In order to expose hudi table context more efficiently, this RFC proposes a Diagnostic Reporter System. |
rfc/rfc-62/rfc-62.md
Outdated
We can quickly catch up the data distribution characteristics of the current hudi table through this part, which can be used to determine whether there is a small file problem | ||
or a data hotspot problem, etc. | ||
In addition, through data distribution and sample hudi keys, we can also use this information to help users choose the most appropriate index mode, for example: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
The second part is the `metadata information` related to the last k active commits, sorted by time as a List | ||
|
||
```json | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am now thinking if we are going to zip .hoodie
anyway, then adding these commit metadata may be redundant. It could still help save some time. But may not add too much value.
Instead, we could add some information about table level aggregate of what actions have been done in a tabular view. For example:
Action | Instant | Num Inserts | Num Updates | Total Errors | More Stats |
---|---|---|---|---|---|
deltacommit | 20220731224018987 | 100 | 100 | 1 | |
replacecommit | 20220731223129863 | 500 | 100 | 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch!
Retain the ability to get the entire .hoodie scene, but is disabled by default.
This zip file can provide the most detailed information, sometimes helpful for tough problems
we could add some information about table level aggregate
Maybe we can add these info at Meta information
part, now maybe we can collect
{
"configs":{
"engine_config":{
"spark.executor.memoryOverhead":"3072"
},
"hoodie_config":{
"hoodie.datasource.write.keygenerator.class":"org.apache.hudi.keygen.ComplexKeyGenerator"
}
},
"commits":[
{
"20220818134233973.commit":{
"totalNumWrites":123352,
"totalNumDeletes":0,
"totalNumUpdateWrites":0,
"totalNumInserts":123352,
"totalWriteBytes":4675371,
"totalWriteErrors":0,
"totalLogRecords":0,
"totalLogFilesCompacted":0,
"totalLogSizeCompacted":0,
"totalUpdatedRecordsCompacted":0,
"totalScanTime":0,
"totalCreateTime":21051,
"totalUpsertTime":0
}
}
]
}
rfc/rfc-62/rfc-62.md
Outdated
2. Hudi metadata table which will records all the partitions, all the data files, etc | ||
|
||
3. Each commit of hudi records various metadata information and runtime metrics currently written, such as: | ||
```json |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can remove the json blob from background. Just mention what kind of stats commit metadata already has, and maybe also point to the class.
``` | ||
|
||
|
||
## Rollout/Adoption Plan |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please fill these details. Especially, any new config, effect on performance if any.
rfc/rfc-62/rfc-62.md
Outdated
|
||
## Implementation | ||
|
||
This Diagnostic Reporter Tool will go through whole hudi table and generate a report json file which contains all the necessary information. Also this tool will package .hoodie folder as a zip compressed file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also allow users to configure application logs dir and zip the logs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Emmm, Of course we can collect Driver logs and all Executor logs. The only worry is that the log volume is too large :<
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some comments, looking forward to this feature!
rfc/rfc-62/rfc-62.md
Outdated
|
||
## Implementation | ||
|
||
This Diagnostic Reporter Tool will go through whole hudi table and generate a report json file which contains all the necessary information. Also this tool will package .hoodie folder as a zip compressed file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe provide an options to compress active timeline related files only? The table may have been running for a long time and the while .hoodie maybe huge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch. added in this rfc
rfc/rfc-62/rfc-62.md
Outdated
"data_volume":"1000GB", | ||
"partition_keys":"year/month/day" | ||
}, | ||
"max_data_volume_per_partition":{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will we record volume info for all partition, or just min/max partition?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just min/max partitions are good enough maybe.
> The JSON is available for both running applications, and in the history server. The endpoints are mounted at /api/v1. | ||
> For example, for the history server, they would typically be accessible at http://<server-url>:18080/api/v1, and for a running application, at http://localhost:4040/api/v1. | ||
|
||
So that we can collect any spark runtime information we want. For example collect all the stages and collect the corresponding execution time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Though it may not related to engine runtime information, could we keep track of instants produced by the current engine and record them somewhere?
Because in many cases, users run multiple clients without OCC turned on. When error occurred, the exception message cannot reveal the root cause (i.e., multi-client issue).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure for each commit, this Diagnostic Reporter will generate report json with runtime info at .hoodie/report/instant-time/
Hi @codope And @YuweiXiao Really appreciate for your efforts here! |
Emm these info looks like pretty sensitive. I am not sure we can collect these info in public report :< |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zhangyue19921010 thanks for taking this up! Some high level thoughts:
- hudi commit metadata vs hudi metrics: if users enable diagnostic reporter, should we have a config to include metrics reporter's data? metrics system is good at showing the trends but hard to cross-check against commit metadata. so regardless of enabling metrics reporter or not, diagnostic reporter can collect metrics and save to report dir, just like a csv/json metrics reporter. We can also refine what goes to metrics and what goes to commit metadata, to keep the responsibilities clear and reporting data organized.
- consolidate with error table: RFC-20 this is a long-pending feature that also aims to assist investigation. diagnostic reporter should be aware of error table settings and zip the error table if configured so. Size could be a concern, so configuration can be given to zip the whole table, or sample records, or skip error table completely. Also it requires some config to allow masking any fields. Taking a step further, we can also make error table one of the diagnostic reporting features. They have similar storage structures: can be local to the hudi table or global to the whole platform.
- work with metadata table: you've already mentioned collecting stats by listing the file system. diagnostic reporter should also be aware of the presence of metadata table and zip the table or extract relevant data - fallback to file system listing if not present.
Thanks @xushiyan for your advice! |
Change Logs
With the development of hudi, more and more users choose hudi to build their own ingestion pipelines to support real-time or batch upsert requirements.
Subsequently, some of them may ask the community for help, such as how to improve the performance of their hudi ingestion jobs? Why did their hudi jobs fail? etc.
For the volunteers in the hudi community, dealing with such issue, the volunteers may ask users to provide a list of information, including engine context, job configs,
data pattern, Spark UI, etc. Users need to spend extra effort to review their own jobs, collect metrics one buy one according to the list and give feedback to volunteers.
By the way, unexpected errors may occur at this time as users are manually collecting these information.
Obviously, there are relatively high communication costs for both volunteers and users.
On the other hand, for advanced users, they also need some way to efficiently understand the characteristics of their hudi tables, including data volume, upsert pattern, and so on.
In order to expose hudi table context more efficiently, this RFC propose a Diagnostic Reporter Tool.
This tool can be turned on as the final stage in ingestion job after commit which will collect common troubleshooting information including engine(take spark as example here) runtime information and generate a diagnostic report json file.
Or users can trigger this diagnostic reporter tool using hudi-cli to generate this report json file.
Impact
no impact
Risk level: none | low | medium | high
Choose one. If medium or high, explain what verification was done to mitigate the risks.
Contributor's checklist