Skip to content
This repository has been archived by the owner on Apr 20, 2023. It is now read-only.

Auto-strip empty <p /> tags #31

Closed
m4olivei opened this issue Apr 14, 2016 · 6 comments
Closed

Auto-strip empty <p /> tags #31

m4olivei opened this issue Apr 14, 2016 · 6 comments
Assignees

Comments

@m4olivei
Copy link
Contributor

m4olivei commented Apr 14, 2016

A lot of our content has <p>&nbsp;</p> in the content. This is kind of a bad habit of editors to introduce vertical space in an article, but I digress. Also, I've found that loading up HTML to pass to the transformer can also introduce empty <p></p> tags. For example:

$html = '<p><h2>I'm a real title</h2></p>';
$document = new \DOMDocument();
$output = '<!doctype html><html><head><meta charset="utf-8" /></head><body>' . $html . '</body>';
@$document->loadHTML($output);
$transformer->transform($body, $document);

In the output you get:

<p></p>
<h2>I'm a real title</h2>

Of course, this is bad source HTML, headings are not allowed as children of <p>, so we'll look at fixing our source HTML, but here an empty paragraph is allowed to pass through to Facebook, where there is no reason for it in the context of FBIA.

Is there a way that the SDK could strip out empty <p> tags? Or is there a Transformer rule that you can write to strip out empty <p> tags (I've tried this to no avail..)?

@scottrigby
Copy link
Contributor

Regarding <p>&nbsp;</p>, we've tried several XPath selector variations, to no avail. Here is the general ignore rule setup we're testing:

  $config = array(
    'class' => 'IgnoreRule',
    'selector' => $selector,
  );
  $rule = IgnoreRule::createFrom($config);
  $transformer->addRule($rule);

where $selector has been //*[text()="&nbsp;"], //*[text()="\u0160"], //*[text()=codepoints-to-string((160))] (which breaks), and other variations using normalize-space(). Is anyone able to find a way to make this work, or would we need to introduce changes in the SDK to allow it?

@m4olivei
Copy link
Contributor Author

m4olivei commented Apr 19, 2016

OK, so I think I've worked out a nice custom Rule class that I'm happy with that works for this purpose:

<?php

/**
 * @file
 * Contains \EmptyRule
 */

use Facebook\InstantArticles\Transformer\Rules\ConfigurationSelectorRule;

/**
 * Matches empty nodes given a selector.  Here empty nodes are defined as nodes
 * in a DOMDocument that have no children, or have only text children containing
 * whitespace characters.
 */
class EmptyRule extends ConfigurationSelectorRule {

  public function __construct() {
  }

  public static function create() {
    return new EmptyRule();
  }

  public static function createFrom($configuration) {
    return self::create()->withSelector($configuration['selector']);
  }

  public function getContextClass() {
    return array(
      InstantArticle::getClassName(),
      Header::getClassName(),
      Footer::getClassName(),
      TextContainer::getClassName(),
    );
  }

  public function matchesContext($context) {
    return TRUE;
  }

  /**
   * @param \DOMNode $node
   * @return mixed
   */
  public function matchesNode($node) {
    // We're only interested in elements here, testing if they are empty.
    if ($node->nodeType !== XML_ELEMENT_NODE) {
      return FALSE;
    }

    // Limit by the selector passed in the configuration.
    if (!parent::matchesNode($node)) {
      return FALSE;
    }

    // Match iff the node has no children and/or all children are empty text
    // nodes.
    if ($node->hasChildNodes()) {
      /* @var \DOMNode $child */
      foreach ($node->childNodes as $child) {
        if ($child->nodeName !== '#text') {
          return FALSE;
        }
        else {
          // @see https://stackoverflow.com/a/27990195/142145
          $trimmed = trim($child->nodeValue, " \t\n\r\0\x0B\xC2\xA0");
          if (!empty($trimmed)) {
            return FALSE;
          }
        }
      }
    }

    return TRUE;
  }

  public function apply($transformer, $context, $element) {
    return $context;
  }
}

Then assuming you've got a \Facebook\InstantArticles\Transformer\Transformer instance called $transfomer kicking around, add a config rule like so:

$transformer->addRule(
    EmptyRule::createFrom(array(
      'class' => 'EmptyRule',
      'selector' => '//p|//div|//span',
    ))
  );

The method EmptyRule::matchesNode is the key. It'll match any element node, matched by the selector given, that either has no child nodes or has only text child nodes containing only whitespace. This reeeeeally annoying part is the need for this trim statement: trim($child->nodeValue, " \t\n\r\0\x0B\xC2\xA0"), which will match the usual whitespace suspects, but also NBSP.

Working for me so far. If you guys are into this, I'd be up for re-jigging this into a proper pull request.

@everton-rosario
Copy link
Contributor

Hey @m4olivei , thanks for digging into this!

This is cool that this solved your cases. I just want to warn you to be aware of this usage.
The Transformer Rules were not designed to have a double pass on same node. So if one specific node is selected by context and the selector, it wont leave this node for another rule.

This way, the empty rule will need a selector that selects only empty #text content.

Im working on a solution right now and will have a pull request soon.
This PR will be:

  • All Elements will have a method isValid()
  • toDOMElements will have an if (!$this->isValid()) -> Wont output itself.
  • Transformer will check all isValid() from elements, if not valid, will add a warning to the transformation

This way we wont:

  • Output invalid/empty elements
  • Wont fail silently (because we will add warnings to the Transformer process)

@m4olivei
Copy link
Contributor Author

@everton-rosario yeah your right. It's only working for me b/c I have it as the last rule added to the Transformer, which means its the first rule that gets checked against all nodes.

Sounds like a neat solution, look forward to testing it. Thanks for looking into this as well!

@everton-rosario everton-rosario self-assigned this Apr 20, 2016
@Faiyaz
Copy link

Faiyaz commented Apr 26, 2016

Any update on this @everton-rosario.

@everton-rosario
Copy link
Contributor

Yes @Faiyaz.
Please follow and also help review this PR: #71

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants