Skip to content

Conversation

@adamziel
Copy link
Collaborator

@adamziel adamziel commented May 31, 2025

Add Namespace Support to XMLProcessor

This PR upgrades XMLProcessor to fully support XML Namespaces 1.0. Tags and attributes are now consistently interpreted according to their declared namespaces, fixing compatibility with WordPress WXR files and EPUB metadata.

New methods signatures:

public function next_tag( $query_or_namespace = null, $local_name_maybe = null );

public function get_tag_local_name();
public function get_tag_namespace();
public function get_attribute( $namespace, $local_name );
public function get_attribute_names_with_prefix( $full_namespace_prefix, $local_name_prefix );
public function set_attribute( $namespace, $local_name, $value );

Usage comparison:

// Before
$processor->next_tag( 'wp:content' );
$processor->get_attribute( 'wp:post-type' );

// After
// Note namespace always comes first. It cannot be skipped as an afterthought. The developer
// must consider it and make an explicit decision about the namespace of every tag and attribute.
$processor->next_tag( 'http://wordpress.org/export/1.2/', 'content' );
$processor->next_tag( '', 'title' );
$processor->next_tag( [ 'http://wordpress.org/export/1.2/', 'content' ] );
$processor->next_tag( [ '*', 'content' ] );
$processor->get_attribute( 'http://wordpress.org/export/1.2/', 'post-type' );

Rationale

The old parser treated tag and attribute names as opaque strings (wp:postmeta, wp:tag, etc.), ignoring that these were syntactic sugar for {namespace}local-name.

This made it impossible to reliably parse WXR files. The wp: may refer to different namespaces in different parts of the XML document.

After this PR, XML namespaces are first-class citizens in all lookup functions which allows us to correctly identify the content-bearing tags in the relevant, top-level WXR namespace.

Implementation Details

  • $stack_of_open_elements tracks the hierarchy of XMLElement frames and the namespaces they define and remove.
  • set_attribute($ns, $attr, $value) and get_attribute($ns, $attr) accept the full namespace string as their first argument to force the developer to take it into consideration.
  • next_tag() and matches_breadcrumbs() accept two-tuples {$namespace, $local_tag_name} instead of string-based tag names. Tag names are still accepted. * wildcards are supported, too.
  • get_breadcrumbs() return an array of two-tuples {$namespace, $local_tag_name}, e.g. [['', 'root'], ['http://wp.org/export/1.2/', 'post']]

Testing instructions

Confirm most of the CI tests pass (aside of the flaky network-related ones)

@adamziel adamziel marked this pull request as ready for review May 31, 2025 23:28
@adamziel adamziel merged commit 3980ed8 into trunk Jun 2, 2025
20 of 21 checks passed
@github-project-automation github-project-automation bot moved this from Inbox to Done in Playground Board Jun 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

No open projects
Archived in project

Development

Successfully merging this pull request may close these issues.

2 participants