document.write replaces entire body #437

Closed
AmaanC opened this Issue Nov 20, 2012 · 15 comments

Comments

Projects
None yet
6 participants

AmaanC commented Nov 20, 2012

This is expected behavior if the document.write comes after the body, but Zombie.js replaces the entire body even if it's there before; i.e. in the head tag. Here is a simple example of the Node.JS file:

var Browser = require('zombie');
var browser = new Browser();

browser.visit('localhost:8080', function (){
    console.log(browser.body.outerHTML);
});

And this is the file being server at localhost:

<html>
    <head>
        <script>
            document.write('This should not replace everything!');
        </script>
    </head>
    <body>
        But it does.
    </body>
</html>

When run in the terminal, the logged console output is:

<body>This should not replace everything!</body>

AmmanC, here is how you can "turn off" the interpretation of document.write (at least part of your pages will survive). Not entirely perfect, but I guess you get the point:

HTML = require('./node_modules/zombie/node_modules/jsdom').dom.level3.html
_origHTMLLanguageProcessorJavaScript = HTML.languageProcessors.javascript
HTML.languageProcessors.javascript = (element, code, filename) ->
  code = code.replace /document.write/g, ""
  _origHTMLLanguageProcessorJavaScript element, code, filename

Assaf, until this is fully fixed (which I understand may be more difficult), would it make sense to have another browser option for this? E.g. ignoreDocumentWrite: true?

Thank you!

Owner

assaf commented Nov 21, 2012

document.write should either work correctly, or not work at all. I'd like to see patches for making it work correctly.

Removing it from the script doesn't solve the problem if someone actually wants it to be there, only introduces another cause for head scratching.

AmaanC commented Nov 21, 2012

Agreed. This may work in some cases, but it should be fixed

AmaanC commented Nov 21, 2012

@jirikopsa Where do I put that code?

@AmaanC The code above is CoffeeScript, but I guess you should be able to translate it into JS easily. You should run this code before any HTML parsing happens - and you need to run this only once (e.g. in the beginning of your application/test script)

AmaanC commented Nov 22, 2012

Hmm, I tried that, @jirikopsa, but I think you're developing on a different platform or something, because I get an error saying the JSDOM module couldn't be found. I looked around in node by hitting tab, and it isn't there. Here's the error:

Error: Cannot find module './node_modules/zombie/node_modules/jsdom'

Here's the code after I converted it form CoffeeScript to JavaScript:

HTML = require('./node_modules/zombie/node_modules/jsdom').dom.level3.html;
_origHTMLLanguageProcessorJavaScript = HTML.languageProcessors.javascript;

HTML.languageProcessors.javascript = function (element, code, filename){
  code = code.replace(/document.write/g, "");
  _origHTMLLanguageProcessorJavaScript(element, code, filename);
}

Here is my test JS file:

var Browser = require('../node_modules/zombie');

// A tiny hack: document.write kills elements in JSDOM,
// so we just replace 'document.write' in JS code with empty string;
// Hence expression 
//    document.write('Some stuff here');
// results in expression:
//    ('Some stuff here');
// which just evaluates, but does nothing.
var HTML = require('../node_modules/zombie/node_modules/jsdom').dom.level3.html;
var _origHTMLLanguageProcessorJavaScript = HTML.languageProcessors.javascript
HTML.languageProcessors.javascript = function(element, code, filename){
  var code = code.replace(/document.write/g, '');
  _origHTMLLanguageProcessorJavaScript(element, code, filename);
}

var fs = require('fs');
file = 'test-inline-script.html';
html = fs.readFileSync(file, 'UTF-8');
browser = new Browser();
browser.load(html, function(error){
  console.log(browser.document.innerHTML);
});

And here goes the html:

<html>
<head>
    <title>test case</title>
</head>
<body>
  <p>We got some stuff here</p>
  <script type="text/javascript">
    document.write("This normally replaces all DOM contents.");
  </script>
</body>
</html>

The console.log within load callback should print the whole DOM structure; i.e. unaffected by broken document.write. Note that this really causes that no document.write() statement is actually executed, hence the page you test is modified - i.e. you test something else.

AmaanC commented Nov 25, 2012

Yeah, I got what it should do, @jirikopsa. I was just asking if there's another place where the JSDOM could be, because the path in the require call is incorrect for me. Here's a list of everything in ../node_modules/zombie/node_modules/:

../node_modules/zombie/node_modules/.bin/
../node_modules/zombie/node_modules/.bin/
../node_modules/zombie/node_modules/.bin/
../node_modules/zombie/node_modules/contextify/
../node_modules/zombie/node_modules/contextify/
../node_modules/zombie/node_modules/contextify/
../node_modules/zombie/node_modules/eventsource/
../node_modules/zombie/node_modules/eventsource/
../node_modules/zombie/node_modules/eventsource/
../node_modules/zombie/node_modules/html5/
../node_modules/zombie/node_modules/html5/
../node_modules/zombie/node_modules/html5/
../node_modules/zombie/node_modules/mime/
../node_modules/zombie/node_modules/mime/
../node_modules/zombie/node_modules/mime/
../node_modules/zombie/node_modules/ms/
../node_modules/zombie/node_modules/ms/
../node_modules/zombie/node_modules/ms/
../node_modules/zombie/node_modules/q/
../node_modules/zombie/node_modules/q/
../node_modules/zombie/node_modules/q/
../node_modules/zombie/node_modules/request/
../node_modules/zombie/node_modules/request/
../node_modules/zombie/node_modules/request/
../node_modules/zombie/node_modules/tough-cookie/
../node_modules/zombie/node_modules/tough-cookie/
../node_modules/zombie/node_modules/tough-cookie/
../node_modules/zombie/node_modules/ws
../node_modules/zombie/node_modules/ws
../node_modules/zombie/node_modules/ws
../node_modules/zombie/node_modules/ws/
../node_modules/zombie/node_modules/ws/
../node_modules/zombie/node_modules/ws/

I don't know why so many of them are duplicates, though 😕

Do I have to install jsdom separately or something?

Contributor

fgalassi commented Nov 29, 2012

Removing it from the script doesn't solve the problem if someone actually wants it to be there, only introduces another cause for head scratching.

@assaf I think you're missing the point here. If someone knows there's a document.write in his code, he'll probably figure it out soon and maybe he will be even able to change the code. The real problem here is when someone doesn't know there's a document.write coming from code that is not his (thirdy party widget, scraping a page, etc..). He will get a broken page where another browser will load the page fine, without any further clue and will be unable to directly change the code. That's a show stopper unless he adventures in resource mocking which is not simple, nor obvious at all. While i understand proper document.write takes a lot of effort and is probably overkill, these are definitely zombie.js weaknesses in descending order of importance:

  1. don't log something like "ALERT: document.write non supported" when it's invoked
  2. mention this in a "limitations" section of the documentation
  3. support easy turning off of document.write with a feature in case user doesn't care and is happy to skip it (e.g. ads, etc..)

@AmaanC Your platform really looks odd. jsdom should be here, at least it's in git repository and also when you install with npm, you should get it. Look into package.json. What I have there is "version": "2.0.0-alpha9"

Owner

assaf commented Nov 29, 2012

Resource mocking not simple or obvious: this needs to be fixed. Zombie is a testing framework, resource mocking falls under that, and I agree it's not clear how Zombie should be used to test when 3rd party. Need to fix that in the documentation for 2.0.

document.write support: it is supported to the extent JSDOM allows us to support it, it will work with some cases, not all cases.

Testing vs scraping: I'm the only person working on Zombie, 100% of my time goes to making it an awesome framework for BDD. I may incidentally work for other purposes, but no effort goes into anything other than BDD.

The internal code base is complicated, it's hard for me to work on it, and it limits community contribution. So one of my goals is to keep the complexity as low as possible, and one thing that helps is only catering to a single use case.

However, 2.0 provides a new extension mechanism (Browser.extend and browser as event emitter), which makes it easy to add custom behavior, whether you want to disable or re-implement document.write, or modify JS before it's parsed by Zombie, these are all possible. Most of this is already in master.

So this can be one way of addressing the needs of Web scraping.

On Thursday, November 29, 2012 at 5:06 AM, Federico Galassi wrote:

Removing it from the script doesn't solve the problem if someone actually wants it to be there, only introduces another cause for head scratching.

@assaf (https://github.com/assaf) I think you're missing the point here. If someone knows there's a document.write in his code, he'll probably figure it out soon and maybe he will be even able to change the code. The real problem here is when someone doesn't know there's a document.write coming from code that is not his (thirdy party widget, scraping a page, etc..). He will get an error where another browser will load the page fine, without any further clue and will be unable to directly change the code. That's a show stopper unless he adventures in resource mocking which is not simple, nor obvious at all. While i understand proper document.write takes a lot of effort and is probably overkill, these are definitely zombie.js weaknesses in descending order of importance:

  1. don't log something like "ALERT: document.write non supported" when it's invoked
  2. mention this in a "limitations" section of the documentation
  3. support easy turning off of document.write with a feature in case user doesn't care and is happy to skip it (e.g. ads, etc..)


Reply to this email directly or view it on GitHub (#437 (comment)).

Hi assaf. I may not have said it yet - THANK YOU FOR THIS AWESOME PIECE OF WORK! When you say 100%, do you mean that this project is supported by demandforce?

Definitely thumbs up for low complexity and extendibility as a means to open it for the community contributions.

We're still in evaluation phase, but if we use zombie in production, some contributions might be coming out of our team - tapomat.com.

Owner

assaf commented Dec 1, 2012

That would be great! Thanks.

On Nov 30, 2012, at 4:01 AM, jirikopsa notifications@github.com wrote:

Hi assaf. I may not have said it - THANK YOU FOR THIS AWESOME PIECE OF WORK!

Definitely thumbs up for low complexity and extendibility as a means to open it for the community contributions.

We're still in evaluation phase, but if we use zombie in production, some contributions might be coming out of our team.


Reply to this email directly or view it on GitHub.

gimler commented Jan 22, 2013

ping

That workaround works for me. I'm using Zombie to screenscrape, so it's useful to be able to deal with any given webpage, even those that use document.write.

assaf closed this Sep 25, 2014

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment