Auto-strip empty tags #31

m4olivei · 2016-04-14T19:39:25Z

A lot of our content has   in the content. This is kind of a bad habit of editors to introduce vertical space in an article, but I digress. Also, I've found that loading up HTML to pass to the transformer can also introduce empty  tags. For example:

$html = '<p><h2>I'm a real title</h2></p>';
$document = new \DOMDocument();
$output = '<!doctype html><html><head><meta charset="utf-8" /></head><body>' . $html . '</body>';
@$document->loadHTML($output);
$transformer->transform($body, $document);

In the output you get:

<p></p>
<h2>I'm a real title</h2>

Of course, this is bad source HTML, headings are not allowed as children of , so we'll look at fixing our source HTML, but here an empty paragraph is allowed to pass through to Facebook, where there is no reason for it in the context of FBIA.

Is there a way that the SDK could strip out empty  tags? Or is there a Transformer rule that you can write to strip out empty  tags (I've tried this to no avail..)?

The text was updated successfully, but these errors were encountered:

scottrigby · 2016-04-16T16:02:58Z

Regarding  , we've tried several XPath selector variations, to no avail. Here is the general ignore rule setup we're testing:

  $config = array(
    'class' => 'IgnoreRule',
    'selector' => $selector,
  );
  $rule = IgnoreRule::createFrom($config);
  $transformer->addRule($rule);

where $selector has been //*[text()=" "], //*[text()="\u0160"], //*[text()=codepoints-to-string((160))] (which breaks), and other variations using normalize-space(). Is anyone able to find a way to make this work, or would we need to introduce changes in the SDK to allow it?

m4olivei · 2016-04-19T19:08:43Z

OK, so I think I've worked out a nice custom Rule class that I'm happy with that works for this purpose:

<?php

/**
 * @file
 * Contains \EmptyRule
 */

use Facebook\InstantArticles\Transformer\Rules\ConfigurationSelectorRule;

/**
 * Matches empty nodes given a selector.  Here empty nodes are defined as nodes
 * in a DOMDocument that have no children, or have only text children containing
 * whitespace characters.
 */
class EmptyRule extends ConfigurationSelectorRule {

  public function __construct() {
  }

  public static function create() {
    return new EmptyRule();
  }

  public static function createFrom($configuration) {
    return self::create()->withSelector($configuration['selector']);
  }

  public function getContextClass() {
    return array(
      InstantArticle::getClassName(),
      Header::getClassName(),
      Footer::getClassName(),
      TextContainer::getClassName(),
    );
  }

  public function matchesContext($context) {
    return TRUE;
  }

  /**
   * @param \DOMNode $node
   * @return mixed
   */
  public function matchesNode($node) {
    // We're only interested in elements here, testing if they are empty.
    if ($node->nodeType !== XML_ELEMENT_NODE) {
      return FALSE;
    }

    // Limit by the selector passed in the configuration.
    if (!parent::matchesNode($node)) {
      return FALSE;
    }

    // Match iff the node has no children and/or all children are empty text
    // nodes.
    if ($node->hasChildNodes()) {
      /* @var \DOMNode $child */
      foreach ($node->childNodes as $child) {
        if ($child->nodeName !== '#text') {
          return FALSE;
        }
        else {
          // @see https://stackoverflow.com/a/27990195/142145
          $trimmed = trim($child->nodeValue, " \t\n\r\0\x0B\xC2\xA0");
          if (!empty($trimmed)) {
            return FALSE;
          }
        }
      }
    }

    return TRUE;
  }

  public function apply($transformer, $context, $element) {
    return $context;
  }
}

Then assuming you've got a \Facebook\InstantArticles\Transformer\Transformer instance called $transfomer kicking around, add a config rule like so:

$transformer->addRule(
    EmptyRule::createFrom(array(
      'class' => 'EmptyRule',
      'selector' => '//p|//div|//span',
    ))
  );

The method EmptyRule::matchesNode is the key. It'll match any element node, matched by the selector given, that either has no child nodes or has only text child nodes containing only whitespace. This reeeeeally annoying part is the need for this trim statement: trim($child->nodeValue, " \t\n\r\0\x0B\xC2\xA0"), which will match the usual whitespace suspects, but also NBSP.

Working for me so far. If you guys are into this, I'd be up for re-jigging this into a proper pull request.

everton-rosario · 2016-04-20T14:05:02Z

Hey @m4olivei , thanks for digging into this!

This is cool that this solved your cases. I just want to warn you to be aware of this usage.
The Transformer Rules were not designed to have a double pass on same node. So if one specific node is selected by context and the selector, it wont leave this node for another rule.

This way, the empty rule will need a selector that selects only empty #text content.

Im working on a solution right now and will have a pull request soon.
This PR will be:

All Elements will have a method isValid()
toDOMElements will have an if (!$this->isValid()) -> Wont output itself.
Transformer will check all isValid() from elements, if not valid, will add a warning to the transformation

This way we wont:

Output invalid/empty elements
Wont fail silently (because we will add warnings to the Transformer process)

m4olivei · 2016-04-20T14:19:40Z

@everton-rosario yeah your right. It's only working for me b/c I have it as the last rule added to the Transformer, which means its the first rule that gets checked against all nodes.

Sounds like a neat solution, look forward to testing it. Thanks for looking into this as well!

Faiyaz · 2016-04-26T21:19:29Z

Any update on this @everton-rosario.

everton-rosario · 2016-05-03T11:39:34Z

Yes @Faiyaz.
Please follow and also help review this PR: #71

everton-rosario self-assigned this Apr 20, 2016

everton-rosario closed this as completed May 12, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Auto-strip empty <p /> tags #31

Auto-strip empty <p /> tags #31

m4olivei commented Apr 14, 2016 •

edited

scottrigby commented Apr 16, 2016

m4olivei commented Apr 19, 2016 •

edited

everton-rosario commented Apr 20, 2016

m4olivei commented Apr 20, 2016

Faiyaz commented Apr 26, 2016

everton-rosario commented May 3, 2016

Auto-strip empty <p /> tags #31

Auto-strip empty <p /> tags #31

Comments

m4olivei commented Apr 14, 2016 • edited

scottrigby commented Apr 16, 2016

m4olivei commented Apr 19, 2016 • edited

everton-rosario commented Apr 20, 2016

m4olivei commented Apr 20, 2016

Faiyaz commented Apr 26, 2016

everton-rosario commented May 3, 2016

m4olivei commented Apr 14, 2016 •

edited

m4olivei commented Apr 19, 2016 •

edited