Skip to content

feat: [HUDI-9775] Support for show_metadata_table_history with appropriate matching data table info#14303

Open
PavithranRick wants to merge 4 commits intoapache:masterfrom
PavithranRick:pavi-hudi-9775
Open

feat: [HUDI-9775] Support for show_metadata_table_history with appropriate matching data table info#14303
PavithranRick wants to merge 4 commits intoapache:masterfrom
PavithranRick:pavi-hudi-9775

Conversation

@PavithranRick
Copy link
Contributor

Describe the issue this Pull Request addresses

This PR introduces the ShowMetadataTableHistoryProcedure, a new Spark SQL procedure that displays timeline information for both the data table and metadata table side-by-side, enabling analysis of metadata table synchronization and evolution.

Summary and Changelog

  • Unified Timeline View: Shows data table and metadata table timelines in a single result
  • Side-by-Side Format: Metadata table columns first, then data table columns
  • Time Formatting: Proper MM-dd HH:mm:ss formatting for requested/inflight/completed times
  • Archive Support: Includes archived timeline via showArchived parameter
  • Filtering: SQL-based filtering and time range support (startTime / endTime)
  • Error Handling: Graceful handling when metadata table doesn't exist

Impact

  • New public API:
    show_metadata_table_history(table, path, limit, showArchived, filter, startTime, endTime)
  • 11-column output schema showing metadata table and data table timeline information

Risk Level

Low
Verification performed:

  • Comprehensive test coverage: 3 focused test cases
  • Schema validation: all output fields properly typed and validated
  • Error handling: graceful handling of invalid filters, missing tables, and timeline access failures
  • Timeline consistency: proper handling of both active and archived timelines with correct state mapping

Documentation Update

Update Hudi Spark SQL procedures documentation to include show_metadata_table_history usage examples and parameter descriptions.

Contributor's checklist

  • Read through contributor's guide
  • Enough context is provided in the sections above
  • Adequate tests were added if applicable

@PavithranRick PavithranRick marked this pull request as ready for review November 20, 2025 00:02
@github-actions github-actions bot added the size:L PR with lines of changes in (300, 1000] label Nov 20, 2025
@yihua yihua self-assigned this Nov 25, 2025
@hudi-bot
Copy link
Collaborator

hudi-bot commented Dec 1, 2025

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@yihua yihua removed their assignment Dec 13, 2025
@yihua
Copy link
Contributor

yihua commented Dec 13, 2025

@CTTY could you help review this PR?

Copy link
Contributor

@CTTY CTTY left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @PavithranRick , thanks for the contribution! I've left some comments. Haven't got a chance to check the unit test yet but will do later


private val PARAMETERS = Array[ProcedureParameter](
ProcedureParameter.optional(0, "table", DataTypes.StringType),
ProcedureParameter.optional(1, "path", DataTypes.StringType, ""),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ProcedureParameter.optional(1, "path", DataTypes.StringType, ""),
ProcedureParameter.optional(1, "path", DataTypes.StringType),

path should not have default value

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should the table and path be required rather than optional?

Comment on lines +39 to +41
ProcedureParameter.optional(4, "filter", DataTypes.StringType, ""),
ProcedureParameter.optional(5, "startTime", DataTypes.StringType, ""),
ProcedureParameter.optional(6, "endTime", DataTypes.StringType, "")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: We do not have to assign default values for these


val filteredEntries = applyFilter(entries, filter, outputType)

val finalEntries = if (startTime.nonEmpty && endTime.nonEmpty) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could it be possible that only startTime is empty, and endTime is not?

Comment on lines +127 to +148
val filteredInstantTimes = if (startTime.nonEmpty && endTime.nonEmpty) {
allInstantTimes.filter { instantTime =>
val timeMatches = (instantTime >= startTime && instantTime <= endTime)
timeMatches
}
} else {
allInstantTimes
}

val sortedInstantTimes = filteredInstantTimes.toSeq.sorted(Ordering[String].reverse)

val entries = sortedInstantTimes.map { instantTime =>
createTimelineEntry(
instantTime,
dataActiveTimeline,
dataArchivedTimeline,
metadataActiveTimeline,
metadataArchivedTimeline,
dataInstantInfoMap,
metadataInstantInfoMap
)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is kinda verbose, maybe

Suggested change
val filteredInstantTimes = if (startTime.nonEmpty && endTime.nonEmpty) {
allInstantTimes.filter { instantTime =>
val timeMatches = (instantTime >= startTime && instantTime <= endTime)
timeMatches
}
} else {
allInstantTimes
}
val sortedInstantTimes = filteredInstantTimes.toSeq.sorted(Ordering[String].reverse)
val entries = sortedInstantTimes.map { instantTime =>
createTimelineEntry(
instantTime,
dataActiveTimeline,
dataArchivedTimeline,
metadataActiveTimeline,
metadataArchivedTimeline,
dataInstantInfoMap,
metadataInstantInfoMap
)
}
val entries =
allInstantTimes
.filter(t => startTime.nonEmpty && endTime.nonEmpty || (t >= startTime && t <= endTime))
.toSeq
.sorted(Ordering.String.reverse)
.map { instantTime =>
createTimelineEntry(
instantTime,
dataActiveTimeline,
dataArchivedTimeline,
metadataActiveTimeline,
metadataArchivedTimeline,
dataInstantInfoMap,
metadataInstantInfoMap
)
}

completionTime: String,
modificationTimeMs: Long
) {
def getModificationTime: Long = modificationTimeMs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iirc, scala case class variables can be directly accessed like instantWithModTime.modificationTime, and we don't need an extra getter

instantMap.getOrElseUpdate(instant.requestedTime(), scala.collection.mutable.Map[HoodieInstant.State, HoodieInstantWithModTime]())
.put(instant.getState, instantWithModTime)
} catch {
case _: Exception => // Skip invalid files
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we log invalid files?

timestamp -> stateMap.toMap
}.toMap
} catch {
case _: Exception => Map.empty[String, Map[HoodieInstant.State, HoodieInstantWithModTime]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, should we log it?

Comment on lines +266 to +284
val stateMap = instantInfoMap.get(instantTimestamp)
if (stateMap.isDefined) {
val stateInfo = stateMap.get.get(state)
if (stateInfo.isDefined) {
val modificationTime = stateInfo.get.getModificationTime
if (modificationTime > 0) {
val date = new java.util.Date(modificationTime)
val formatter = new java.text.SimpleDateFormat("MM-dd HH:mm:ss")
formatter.format(date)
} else {
instantTimestamp
}
} else {
null
}
} else {
null
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can simplify this to

Suggested change
val stateMap = instantInfoMap.get(instantTimestamp)
if (stateMap.isDefined) {
val stateInfo = stateMap.get.get(state)
if (stateInfo.isDefined) {
val modificationTime = stateInfo.get.getModificationTime
if (modificationTime > 0) {
val date = new java.util.Date(modificationTime)
val formatter = new java.text.SimpleDateFormat("MM-dd HH:mm:ss")
formatter.format(date)
} else {
instantTimestamp
}
} else {
null
}
} else {
null
}
}
val formatter = new java.text.SimpleDateFormat("MM-dd HH:mm:ss")
instantInfoMap.get(instantTimestamp)
.flatMap(_.get(state))
.map(_.modificationTimeMs)
.filter(_ > 0)
.map(time => formatter.format(new java.util.Date(time)))
.getOrElse(instantTimestamp)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:L PR with lines of changes in (300, 1000]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants