Experiment about searching for text in the DOM, transcending element boundaries
What is this
This is an experiment about how to locate text patterns in a DOM, when the match might span multiple nodes, and we don't know where to look exactly.
The code traverses the (configured part of) the dom, and collects info about which string slice is where. Then it searches for the pattern in the innerText of he (configured part of the) DOM using google-diff-match-patch (see app/lib/fancymatcher/README.txt for more info about this part.)
When a match is found, it maps in back to DOM elements, using the collected information, and it returns info about where to text. (XPath expressions and string indices.)
Optionally, it can also highlight the match in the DOM.
All the DOM analyzing logic is in one CoffeeScript file ( app/lib/domsearcher/domsearcher.coffee )
How to run
- run scripts/web-server.js
- Go to http://localhost:8000/
- Click the buttons, and see what happens.
Or see the live demo.
Unfortunately, there is a limitation about the max length of pattern. The exact value varies by browser and platform, but it's typically 32 or 64. Which means you can't search for patterns longer that that. (The getMaxPatternLength method returns the current limit.)
Currently, this is worked around by searching for the first and the last 32-bit slice of the pattern, and if both is found, and the distance is about right, then it's accepted as match. What goes between the start and the end slice is not analyzed. This might lead to false matches.
When the "display" property of a node is set to "none", it's not displayd, so it's content does not get into it's parent's innerText.
However, there is no easy way to detect this when one is only looking at the DOM. (Like we do.)
So, currently we are trying to detect whether a node (and it's children) is hidden by searching for it's innerText in the innerText of the parent. (If there is no match, it means that this is probably hidden.)
However, this approach can yield "false negatives", if a node is indeed hidden, but the content is the same like that of an other, non-hidden node.
This would not break the results, but would add false parts to the selection.