[RFC-62] Diagnostic Reporter #6600

zhangyue19921010 · 2022-09-05T13:50:10Z

Change Logs

With the development of hudi, more and more users choose hudi to build their own ingestion pipelines to support real-time or batch upsert requirements.
Subsequently, some of them may ask the community for help, such as how to improve the performance of their hudi ingestion jobs? Why did their hudi jobs fail? etc.

For the volunteers in the hudi community, dealing with such issue, the volunteers may ask users to provide a list of information, including engine context, job configs,
data pattern, Spark UI, etc. Users need to spend extra effort to review their own jobs, collect metrics one buy one according to the list and give feedback to volunteers.
By the way, unexpected errors may occur at this time as users are manually collecting these information.

Obviously, there are relatively high communication costs for both volunteers and users.

On the other hand, for advanced users, they also need some way to efficiently understand the characteristics of their hudi tables, including data volume, upsert pattern, and so on.

In order to expose hudi table context more efficiently, this RFC propose a Diagnostic Reporter Tool.
This tool can be turned on as the final stage in ingestion job after commit which will collect common troubleshooting information including engine(take spark as example here) runtime information and generate a diagnostic report json file.

Or users can trigger this diagnostic reporter tool using hudi-cli to generate this report json file.

Impact

no impact

Risk level: none | low | medium | high

Choose one. If medium or high, explain what verification was done to mitigate the risks.

Contributor's checklist

Read through contributor's guide
Change Logs and Impact were stated clearly
Adequate tests were added if applicable
CI passed

parisni · 2022-09-07T08:34:04Z

Could we also mention the need of obfuscation of enterprise informations such ip address, hoatname, bucket, tables or columns names and so on ?

codope

@zhangyue19921010 Thanks for writing this RFC. A diagnostic reporter would be very useful for the community.
I had a broader scope in mind. Essentially, thinking in terms of system instead of tooling. We can run a web app to view these metrics in a more user-friendly manner (much like). We can split the implementation into multiple phases. In the first phase, we can implement whatever you suggested, gather more information based on adooption of this feature. Then in subsequent phases, we can work on developing the web app. Let me know what you think.
cc @xushiyan @prasannarajaperumal

codope · 2022-09-08T06:10:41Z

rfc/rfc-62/rfc-62.md

+
+## Proposers
+
+- zhangyue19921010@163.com


nit: just keep it as your github user id i.e. zhangyue19921010

codope · 2022-09-08T06:11:26Z

rfc/rfc-62/rfc-62.md

+
+JIRA: https://issues.apache.org/jira/browse/HUDI-4707
+
+> Please keep the status updated in `rfc/README.md`.


can remove this line

codope · 2022-09-08T06:12:02Z

rfc/rfc-62/rfc-62.md

+## Background
+As we know, hudi already has its own unique metrics system and metadata framework. These information are very important for hudi job tuning or troubleshooting. Fo example:
+
+1. Hudi will record the complete timeline in the .hoodie directory, including active timeline and archive timeline. From this we can trace the historical state of the hudi job.


Suggested change

1. Hudi will record the complete timeline in the .hoodie directory, including active timeline and archive timeline. From this we can trace the historical state of the hudi job.

1. Hudi will record the complete timeline in the `.hoodie` directory, including active timeline and archive timeline. The timeline acts as an **event log** for the Hudi table using which one can track table snapshots.

codope · 2022-09-08T06:19:56Z

rfc/rfc-62/rfc-62.md

+
+1. Hudi will record the complete timeline in the .hoodie directory, including active timeline and archive timeline. From this we can trace the historical state of the hudi job.
+
+2. Hudi metadata table which will records all the partitions, all the data files, etc


Suggested change

2. Hudi metadata table which will records all the partitions, all the data files, etc

2. Hudi metadata table which records partitions, data files, columns statistics, etc.

codope · 2022-09-08T06:21:13Z

rfc/rfc-62/rfc-62.md

+}
+```
+
+In order to expose hudi table context more efficiently, this RFC propose a Diagnostic Reporter Tool.


Suggested change

In order to expose hudi table context more efficiently, this RFC propose a Diagnostic Reporter Tool.

In order to expose hudi table context more efficiently, this RFC proposes a Diagnostic Reporter System.

codope · 2022-09-08T06:40:36Z

rfc/rfc-62/rfc-62.md

+We can quickly catch up the data distribution characteristics of the current hudi table through this part, which can be used to determine whether there is a small file problem 
+or a data hotspot problem, etc.
+In addition, through data distribution and sample hudi keys, we can also use this information to help users choose the most appropriate index mode, for example:


codope · 2022-09-08T06:54:47Z

rfc/rfc-62/rfc-62.md

+The second part is the `metadata information` related to the last k active commits, sorted by time as a List
+
+```json
+{


I am now thinking if we are going to zip .hoodie anyway, then adding these commit metadata may be redundant. It could still help save some time. But may not add too much value.
Instead, we could add some information about table level aggregate of what actions have been done in a tabular view. For example:

Action Instant Num Inserts Num Updates Total Errors More Stats

deltacommit 20220731224018987 100 100 1

replacecommit 20220731223129863 500 100 0

Nice catch!
Retain the ability to get the entire .hoodie scene, but is disabled by default.
This zip file can provide the most detailed information, sometimes helpful for tough problems

we could add some information about table level aggregate

Maybe we can add these info at Meta information part, now maybe we can collect

{ "configs":{ "engine_config":{ "spark.executor.memoryOverhead":"3072" }, "hoodie_config":{ "hoodie.datasource.write.keygenerator.class":"org.apache.hudi.keygen.ComplexKeyGenerator" } }, "commits":[ { "20220818134233973.commit":{ "totalNumWrites":123352, "totalNumDeletes":0, "totalNumUpdateWrites":0, "totalNumInserts":123352, "totalWriteBytes":4675371, "totalWriteErrors":0, "totalLogRecords":0, "totalLogFilesCompacted":0, "totalLogSizeCompacted":0, "totalUpdatedRecordsCompacted":0, "totalScanTime":0, "totalCreateTime":21051, "totalUpsertTime":0 } } ] }

codope · 2022-09-08T06:56:00Z

rfc/rfc-62/rfc-62.md

+2. Hudi metadata table which will records all the partitions, all the data files, etc
+
+3. Each commit of hudi records various metadata information and runtime metrics currently written, such as:
+```json


Can remove the json blob from background. Just mention what kind of stats commit metadata already has, and maybe also point to the class.

codope · 2022-09-08T06:56:58Z

rfc/rfc-62/rfc-62.md

+```
+
+
+## Rollout/Adoption Plan


Please fill these details. Especially, any new config, effect on performance if any.

codope · 2022-09-08T07:05:11Z

rfc/rfc-62/rfc-62.md

+
+## Implementation
+
+This Diagnostic Reporter Tool will go through whole hudi table and generate a report json file which contains all the necessary information. Also this tool will package .hoodie folder as a zip compressed file.


Should we also allow users to configure application logs dir and zip the logs?

Emmm, Of course we can collect Driver logs and all Executor logs. The only worry is that the log volume is too large :<

YuweiXiao

Left some comments, looking forward to this feature!

YuweiXiao · 2022-09-08T07:47:58Z

rfc/rfc-62/rfc-62.md

+
+## Implementation
+
+This Diagnostic Reporter Tool will go through whole hudi table and generate a report json file which contains all the necessary information. Also this tool will package .hoodie folder as a zip compressed file.


Maybe provide an options to compress active timeline related files only? The table may have been running for a long time and the while .hoodie maybe huge.

Nice catch. added in this rfc

YuweiXiao · 2022-09-08T07:48:32Z

rfc/rfc-62/rfc-62.md

+        "data_volume":"1000GB",
+        "partition_keys":"year/month/day"
+    },
+    "max_data_volume_per_partition":{


Will we record volume info for all partition, or just min/max partition?

just min/max partitions are good enough maybe.

YuweiXiao · 2022-09-08T07:53:01Z

rfc/rfc-62/rfc-62.md

+> The JSON is available for both running applications, and in the history server. The endpoints are mounted at /api/v1. 
+> For example, for the history server, they would typically be accessible at http://<server-url>:18080/api/v1, and for a running application, at http://localhost:4040/api/v1.
+
+So that we can collect any spark runtime information we want. For example collect all the stages and collect the corresponding execution time


Though it may not related to engine runtime information, could we keep track of instants produced by the current engine and record them somewhere?

Because in many cases, users run multiple clients without OCC turned on. When error occurred, the exception message cannot reveal the root cause (i.e., multi-client issue).

Sure for each commit, this Diagnostic Reporter will generate report json with runtime info at .hoodie/report/instant-time/

zhangyue19921010 · 2022-09-15T10:08:33Z

Hi @codope And @YuweiXiao Really appreciate for your efforts here!
All comments are addressed. PTAL!

zhangyue19921010 · 2022-09-15T10:14:05Z

Could we also mention the need of obfuscation of enterprise informations such ip address, hoatname, bucket, tables or columns names and so on ?

Emm these info looks like pretty sensitive. I am not sure we can collect these info in public report :<

xushiyan

@zhangyue19921010 thanks for taking this up! Some high level thoughts:

hudi commit metadata vs hudi metrics: if users enable diagnostic reporter, should we have a config to include metrics reporter's data? metrics system is good at showing the trends but hard to cross-check against commit metadata. so regardless of enabling metrics reporter or not, diagnostic reporter can collect metrics and save to report dir, just like a csv/json metrics reporter. We can also refine what goes to metrics and what goes to commit metadata, to keep the responsibilities clear and reporting data organized.
consolidate with error table: RFC-20 this is a long-pending feature that also aims to assist investigation. diagnostic reporter should be aware of error table settings and zip the error table if configured so. Size could be a concern, so configuration can be given to zip the whole table, or sample records, or skip error table completely. Also it requires some config to allow masking any fields. Taking a step further, we can also make error table one of the diagnostic reporting features. They have similar storage structures: can be local to the hudi table or global to the whole platform.
work with metadata table: you've already mentioned collecting stats by listing the file system. diagnostic reporter should also be aware of the presence of metadata table and zip the table or extract relevant data - fallback to file system listing if not present.

zhangyue19921010 · 2022-11-17T06:48:25Z

@zhangyue19921010 thanks for taking this up! Some high level thoughts:

hudi commit metadata vs hudi metrics: if users enable diagnostic reporter, should we have a config to include metrics reporter's data? metrics system is good at showing the trends but hard to cross-check against commit metadata. so regardless of enabling metrics reporter or not, diagnostic reporter can collect metrics and save to report dir, just like a csv/json metrics reporter. We can also refine what goes to metrics and what goes to commit metadata, to keep the responsibilities clear and reporting data organized.

consolidate with error table: RFC-20 this is a long-pending feature that also aims to assist investigation. diagnostic reporter should be aware of error table settings and zip the error table if configured so. Size could be a concern, so configuration can be given to zip the whole table, or sample records, or skip error table completely. Also it requires some config to allow masking any fields. Taking a step further, we can also make error table one of the diagnostic reporting features. They have similar storage structures: can be local to the hudi table or global to the whole platform.

work with metadata table: you've already mentioned collecting stats by listing the file system. diagnostic reporter should also be aware of the presence of metadata table and zip the table or extract relevant data - fallback to file system listing if not present.

Thanks @xushiyan for your advice!
Will have a deep look and expand this rfc asap!

rfc-62 Diagnostic Reporter

5f252d6

yihua added priority:blocker rfc labels Sep 5, 2022

codope assigned codope and xushiyan Sep 6, 2022

codope reviewed Sep 8, 2022

View reviewed changes

YuweiXiao reviewed Sep 8, 2022

View reviewed changes

yuezhang added 2 commits September 15, 2022 10:26

merge from master

7993b28

reviewed

9a56081

Delete temp.json

2e31f4f

xushiyan reviewed Nov 16, 2022

View reviewed changes

nsivabalan added priority:critical production down; pipelines stalled; Need help asap. and removed priority:blocker labels Jan 24, 2023

vinothchandar self-assigned this Feb 16, 2023

github-actions bot added the size:L PR with lines of changes in (300, 1000] label Feb 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC-62] Diagnostic Reporter #6600

[RFC-62] Diagnostic Reporter #6600

zhangyue19921010 commented Sep 5, 2022 •

edited

parisni commented Sep 7, 2022

codope left a comment

codope Sep 8, 2022

codope Sep 8, 2022

codope Sep 8, 2022

codope Sep 8, 2022

codope Sep 8, 2022

codope Sep 8, 2022

codope Sep 8, 2022

zhangyue19921010 Sep 15, 2022

codope Sep 8, 2022

codope Sep 8, 2022

codope Sep 8, 2022

zhangyue19921010 Sep 15, 2022

YuweiXiao left a comment

YuweiXiao Sep 8, 2022 •

edited

zhangyue19921010 Sep 15, 2022 •

edited

YuweiXiao Sep 8, 2022

zhangyue19921010 Sep 15, 2022

YuweiXiao Sep 8, 2022

zhangyue19921010 Sep 15, 2022

zhangyue19921010 commented Sep 15, 2022

zhangyue19921010 commented Sep 15, 2022 •

edited

xushiyan left a comment

zhangyue19921010 commented Nov 17, 2022


		JIRA: https://issues.apache.org/jira/browse/HUDI-4707

		> Please keep the status updated in `rfc/README.md`.

	1. Hudi will record the complete timeline in the .hoodie directory, including active timeline and archive timeline. From this we can trace the historical state of the hudi job.
	1. Hudi will record the complete timeline in the `.hoodie` directory, including active timeline and archive timeline. The timeline acts as an event log for the Hudi table using which one can track table snapshots.


		1. Hudi will record the complete timeline in the .hoodie directory, including active timeline and archive timeline. From this we can trace the historical state of the hudi job.

		2. Hudi metadata table which will records all the partitions, all the data files, etc

	2. Hudi metadata table which will records all the partitions, all the data files, etc
	2. Hudi metadata table which records partitions, data files, columns statistics, etc.

Action	Instant	Num Inserts	Num Updates	Total Errors	More Stats
deltacommit	20220731224018987	100	100	1
replacecommit	20220731223129863	500	100	0


		## Implementation

		This Diagnostic Reporter Tool will go through whole hudi table and generate a report json file which contains all the necessary information. Also this tool will package .hoodie folder as a zip compressed file.

[RFC-62] Diagnostic Reporter #6600

Are you sure you want to change the base?

[RFC-62] Diagnostic Reporter #6600

Conversation

zhangyue19921010 commented Sep 5, 2022 • edited

Change Logs

Impact

Contributor's checklist

parisni commented Sep 7, 2022

codope left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

YuweiXiao left a comment

Choose a reason for hiding this comment

YuweiXiao Sep 8, 2022 • edited

Choose a reason for hiding this comment

zhangyue19921010 Sep 15, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhangyue19921010 commented Sep 15, 2022

zhangyue19921010 commented Sep 15, 2022 • edited

xushiyan left a comment

Choose a reason for hiding this comment

zhangyue19921010 commented Nov 17, 2022

zhangyue19921010 commented Sep 5, 2022 •

edited

YuweiXiao Sep 8, 2022 •

edited

zhangyue19921010 Sep 15, 2022 •

edited

zhangyue19921010 commented Sep 15, 2022 •

edited