Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Already on GitHub? Sign in to your account

Casper.page.content adds html tags for non-html content types #178

Open
n1k0 opened this Issue Jul 11, 2012 · 31 comments

Comments

Projects
None yet
10 participants
Owner

n1k0 commented Jul 11, 2012

We should add a getPageContent() method to casper to access raw body contents when response content-type is not text/html.

@n1k0 n1k0 closed this in 76e0b5c Jul 11, 2012

Owner

n1k0 commented Jul 11, 2012

Sample script demoing usage:

var casper = require('casper').create();

casper.start().then(function() {
    this.open('http://search.twitter.com/search.json?q=casperjs', {
        method: 'get',
        headers: {
            'Accept': 'application/json'
        }
    });
});

casper.then(function() {
    require('utils').dump(JSON.parse(this.getPageContent()));
});

casper.run(function() {
    this.exit();
});

n1k0 added a commit that referenced this issue Jul 14, 2012

fixes #178 - added Casper.getPageContent()
Extracts and returns the raw body contents for latest retrived page.

Sample script:

```javascript
var casper = require('casper').create();

casper.start().then(function() {
    this.open('http://search.twitter.com/search.json?q=casperjs', {
        method: 'get',
        headers: {
            'Accept': 'application/json'
        }
    });
});

casper.then(function() {
    require('utils').dump(JSON.parse(this.getPageContent()));
});

casper.run(function() {
    this.exit();
});
```
Contributor

maerten commented Sep 17, 2012

This doesn't seem to work when the JSON is prettified, because (.*) doesn't match multiline strings.
This worked for me though:

([^]*)

Thanks for your efforts on casperjs!

Owner

n1k0 commented Sep 17, 2012

I'm reopening

@n1k0 n1k0 reopened this Sep 17, 2012

Owner

n1k0 commented Oct 14, 2012

Fixed in 4688262 (refs #239)

@n1k0 n1k0 closed this Oct 14, 2012

mwcz commented May 1, 2014

This is happening again. I've tested with current master, and the 1.1.beta* tags. The output from the sample script above is:

SyntaxError: Unable to parse JSON string                                        

The output of this.getPageContent() is:

<html><head></head><body><pre style="word-wrap: break-word; white-space: pre-wrap;">{"errors":[{"message":"The Twitter REST API v1 is no longer active. Please migrate to API v1.1. https://dev.twitter.com/docs/api/1.1/overview.","code":64}]}</pre></body></html>

Here's a lightly modified script that prints out the exact response:

var casper = require('casper').create();

casper.start().then(function() {
    this.open('http://search.twitter.com/search.json?q=casperjs', {
        method: 'get',
        headers: {
            'Accept': 'application/json'
        }
    });
});

casper.then(function() {
    console.log(this.getPageContent());
});

casper.run(function() {
    this.exit();
});

Hmm, the twitter API seems to have changed and can't be used as a source of JSON anymore. But I'm seeing the issue nonetheless, with a working JSON-emitting API. Even with the Accept header set, the response still gets wrapped in HTML tags. From inspecting the code, this is done by webkit and gecko.

Owner

mickaelandrieu commented May 2, 2014

Well it's little bit annoying... sometimes I'm prefixing url by view-source: to enforce webkit to display source json code instead of decorated html preview.

Not a casperJs bug IMO ;)

mwcz commented May 2, 2014

"view-source:" is a great workaround, thanks!

I know the errant HTML isn't added by CasperJS, but I would contest that it is a bug in CasperJS simply because Casper's documentation of the function is misleading, and the sample code doesn't work.

getPageContent()
....
Retrieves current page contents, dealing with exotic other content types than HTML:
... code sample...
    require('utils').dump(JSON.parse(this.getPageContent()));
....

Original docs

The sample code fails because the result of this.getPageContent() is not valid JSON.

I think Casper's documentation should describe clearly what getPageContent returns, or remove the HTML on the user's behalf (the better solution, IMO). I'm happy to put together a pull request for either clarifying-the-docs or stripping-the-html.

Casper devs, which would you prefer? :)

Owner

mickaelandrieu commented May 3, 2014

ping @n1k0 imo we should dont change the getContentPage API.
But I'm open minded to arguments: let's debate :)

@mickaelandrieu mickaelandrieu reopened this May 3, 2014

mwcz commented May 7, 2014

Hey @mickaelandrieu, I tried prepending "view-source:" to the URL, but didn't have any luck.

After casper.open(JSON_URL), casper.getPageContent() returns:

<html><head></head><body>[{"key1": "value1", "key2": "value2"}]</body></html>

After casper.open("view-source:" + JSON_URL), casper.getPageContent() returns:

<html><head></head><body></body></html>
Owner

n1k0 commented May 19, 2014

I'll try going with page.plainText and see what happens.

@n1k0 n1k0 closed this in #926 May 19, 2014

n1k0 added a commit that referenced this issue May 19, 2014

Merge pull request #926 from n1k0/bug-178-plain-text-fallback
Refs #178 - plain text version for non-html contents.

I have found the similar issue. If I use the following code:
casper.page.onResourceReceived = function(resource) {
if (url == resource.url && resource.redirectURL) {
casper.echo("Get redirect to: " + resource.redirectURL);
}
};

the following open, openThen, etc ignores text/plain header and I always get my plain text inside html tags, like:

<html><head></head><body><pre style="word-wrap: break-word; white-space: pre-wrap;">My Plain text here</pre></body></html>

Just a hint: for now, it is possible to get "really raw" contents using casper.download()

Why this issue is closed? This bug remains unfixed and "view-source:" hack doesn't work.

This issue does not appear to be fixed. Any updates?

When requesting pure JSON from a known good (serves up only JSON) URL I also get the HTML tags mentioned above.

Collaborator

istr commented Jan 2, 2016

This is fixed in master, see the test proving it.

NOTE however, that your server needs to deliver the correct Content-Type header (application/json) -- there is no content sniffing whatsoever.

Thanks for the feedback. I'll have to give it a try.

For what it's worth, I discovered that the problem was intermittent. In cases where the returned JSON was not 'prettified' all worked well. If I set my server's JSON output to be 'pretty' then the HTML tags appeared and caused JSON.parse() to fail. My workaround has been to simply strip out all HTML with a regex before attempting to create a JSON object. This approach is working for me.

Contributor

entrptaher commented May 14, 2016

This is not fixed yet.
Only if somehow this is implemented, that'd be great.

JSON.parse(this.getPageContent().replace(/<\/?[^>]+(>|$)/g, ""));
Collaborator

istr commented May 14, 2016

@entrptaher Could you explain what you mean with your code snippet and/or a test case that demonstrates this still does not work? Did you verify that your server returns the correct Content-Type header? (see above)

Collaborator

istr commented May 14, 2016

@dotmechanize

If I set my server's JSON output to be 'pretty' then the HTML tags appeared and caused JSON.parse() to fail.

Are you sure that setting the "prettify" option did not change the response's Content-Type header?

Contributor

entrptaher commented May 14, 2016

UPDATED: Added CODE 5

CODE 1(Works as Expected)

var casper = require('casper')
    .create();

casper.start();
casper.open('http://localhost/');
casper.echo("...");
casper.then(function dumpHeaders() {
    this.currentResponse.headers.forEach(function (header) {
        console.log(header.name + ': ' + header.value);
    });
});
casper.then(function getthecontents() {
    console.log(this.getPageContent());
});
casper.run(function () {
    this.exit();
});

##Output:

...
X-Powered-By: Express
Content-Type: application/json; charset=utf-8
Content-Length: 2
ETag: W/"2-mZFLkyvTelC5g8XnyQrpOw"
Date: Sat, 14 May 2016 11:25:19 GMT
Connection: keep-alive
{"one": "two","key": "value"}

CODE 2(Works as expected)

var port = 6100;
var casper = require("casper")
    .create();
var jsonrules = 'http://localhost/';
casper.start(jsonrules);
casper.run(function () {
    console.log('...');
});
require("webserver")
    .create()
    .listen(port, function (request, response) {
                response.statusCode = 200;
                response.write(casper.getPageContent());
                response.close();
    })
console.log("listening on port", port);

Output

{"one": "two","key": "value"}

CODE 3 (Server times out, used code 4 as workaround)

var port = 6100;
var casper = require("casper")
    .create();
var jsonrules = 'http://localhost/';
casper.start(jsonrules);
casper.run(function () {
    console.log('...');
});
require("webserver")
    .create()
    .listen(port, function (request, response) {
      casper.open(jsonrules)
                .then(function () {
                    casper.then(function dumpHeaders() {
                        this.currentResponse.headers.forEach(function (header) {
                            console.log(header.name + ': ' + header.value);
                        });
                    });
                    this.then(function () {
                        this.echo(this.getPageContent());
                    });
                })
                .then(function () {
                    response.statusCode = 200;
                    response.write('OK');
                    response.close();
                })
/*
// Doesn't work either
casper.open(jsonrules)
                .then(function () {
                    response.statusCode = 200;
                    response.write(this.getPageContent());
                    response.close();
                })
*/
    })
console.log("listening on port", port);

CODE 4(Doesn 't Work)

var port = 6100;
var casper = require("casper")
    .create();
var jsonrules = 'http://localhost/';
casper.start(jsonrules);
casper.run(function () {
    console.log('...');
});
require("webserver")
    .create()
    .listen(port, function (request, response) {
        casper.steps = [];
        casper.step = 0;
        if(request.url.indexOf("/openseseme") !== -1) {
            casper.open(jsonrules)
                .then(function () {
                    casper.then(function dumpHeaders() {
                        this.currentResponse.headers.forEach(function (header) {
                            console.log(header.name + ': ' + header.value);
                        });
                    });
                    this.then(function () {
                        this.echo(this.getPageContent());
                    });
                })
                .then(function () {
                    response.statusCode = 200;
                    response.write('OK');
                    response.close();
                })
        }
        casper.run();
    })
console.log("listening on port", port);

##Output:

listening on port 6100
...
X-Powered-By: Express
ETag: W/"2-mZFLkyvTelC5g8XnyQrpOw"
Date: Sat, 14 May 2016 11:22:10 GMT
Connection: keep-alive
<html><head></head><body><pre style="word-wrap: break-word; white-space: pre-wrap;">{"one": "two","key": "value"}</pre></body></html>

CODE 5 (As per docs, Works as expected, output is same as the output of CODE 1)

var port = 6100;
var casper = require("casper")
    .create();
var jsonrules = 'http://localhost/';
require("webserver")
    .create()
    .listen(port, function (request, response) {
        /*casper.steps = [];
        casper.step = 0;*/
        if(request.url.indexOf("/checkstatus") !== -1) {
            casper.start(jsonrules)
                .then(function () {
                    casper.then(function dumpHeaders() {
                        this.currentResponse.headers.forEach(function (header) {
                            console.log(header.name + ': ' + header.value);
                        });
                    });
                    this.then(function () {
                        this.echo(this.getPageContent());
                    });
                })
                .then(function () {
                    response.statusCode = 200;
                    response.write('OK');
                    response.close();
                })
            casper.run();
        }
    })
console.log("listening on port", port);
Collaborator

istr commented May 14, 2016

##Output:

listening on port 6100
...
X-Powered-By: Express
ETag: W/"2-mZFLkyvTelC5g8XnyQrpOw"
Date: Sat, 14 May 2016 11:22:10 GMT
Connection: keep-alive

Yes, exactly like expected. Your express server does not include the correct Content-Type header here.

The default implied Content-Type when the server does not provide it, is text/html.

No content type sniffing is performed on CasperJS side. This is up to you. Fix your server or sniff the content type on your own.

EDIT: To clarify this: getPageContent uses page.framePlainText only if a Content-Type other than text/html is explicitly given in the server response, see https://github.com/casperjs/casperjs/blob/master/modules/casper.js#L969.

If no Content-Type is explicitly given or the Content-Type is text/html, it returns page.frameContent. This is being wrapped in <html></html> by the underlying engine (PhantomJS/Webkit and SlimerJS/Gecko), not by CasperJS.

Contributor

entrptaher commented May 14, 2016

This is to let you know that all of those scripts had same target url. And the page does have Application/json otherwise the first and second code wouldn't work as expected, nor they would print what happens.
Look at the output of Code 1 and Code 4. Somehow the Code is missing the JSON part.

Collaborator

istr commented May 14, 2016

As the current behavior seems to be counter-intuitive and was the reason for multiple issues raised, I would like to discuss if we should change the behavior, provide Casper.prototype.getPagePlainText along with Casper.prototype.getPageContent and clearly document that getPageContent will always return HTML, like the engine does. @BIGjuevos @mickaelandrieu ?

EDIT: for reference:
https://github.com/casperjs/casperjs/blob/master/modules/casper.js#L969
http://phantomjs.org/api/webpage/property/frame-content.html
http://phantomjs.org/api/webpage/property/frame-plain-text.html

@istr istr reopened this May 14, 2016

Collaborator

istr commented May 14, 2016

@entrptaher

And the page does have Application/json

But not in the failing example. Please try to replace the call to this.getPageContent() with this.page.framePlainText in your example and see if this works for you. If so, it is definitely the missing Content-Type in the failing case.

EDIT: maybe hit by something like this: http://stackoverflow.com/questions/34841855/express-missing-response-content-type-on-second-load

Contributor

entrptaher commented May 14, 2016

The issue might be related to this part of code where I used casper.start outside the webserver. The code didn't work and I had to do a quick workaround by using,

casper.steps = [];
casper.step = 0;

The getPageContent() fails there.

EDIT: no, here, expressjs shows the content type all the time, I tried code 1 after code 2, code 5 after code 3 etc and vice varsa on same page. I tried with http://echo.jsontest.com/key/value/one/two too, same problem.

EDIT 2: YES, just checked, expressjs really removes the content-type header on second load from same host (304 not modified). But firefox/chrome still says the page is a json page. Its true that I'm visiting the same page twice in CODE 4. will see if I do that only once.

EDIT 3: Yes, expressJS was the culprit , wow, thanks for the link, I never even thought about it because I got the problem on jsontest site too (only sometimes, not everytime). So, I used this.page.framePlainText and it worked on any kind of site.

Looks like it copies content from elements view and not the sources view.

BUT here's what I really think,
Firefox/Chrome/Any other browser renders the page as JSON, no matter how many times I reload them, if casperjs uses phantomjs and phantomjs is a browser, then casperjs (and phantomjs) should provide some solution for those who gets into such silly problems. And, just now the solution was here, Please continue the good work. I'll try to contribute as much as possible.

Contributor

entrptaher commented May 14, 2016

Only this four lines could do the trick?

Casper.prototype.getPlainText = function getPlainText() {
    "use strict";
    this.checkStarted();
    return this.page.framePlainText;
};

Yup, it worked. getPlainText()

before:

after

Owner

mickaelandrieu commented May 14, 2016

@entrptaher go for a pull request with a test, thank you for your researches :)

Collaborator

istr commented May 14, 2016

Firefox/Chrome/Any other browser renders the page as JSON, no matter how many times I reload them, if casperjs uses phantomjs and phantomjs is a browser, then casperjs (and phantomjs) should provide some solution for those who gets into such silly problems.

Yes, you're kind of right. PhantomJS is a browser, but with its own deviations. Even though PhantomJS, Chrome, and Safari all build on (different versions) of Webkit they behave wildly different in several edge/corner cases. Firefox / SlimerJS (Gecko) is a completely different story.

I have a different idea, though, that might be more helpful while testing: check if the response contains a Content-Type header and emit a warning that clearly states what will be the consequences. #1584

Contributor

entrptaher commented May 14, 2016

Now, some awesome discovery~~ -_-
fetchText('html') and getPlainText() is same.
Their tests shows same results. Maybe we don't need getPlainText() but rather we can create an example of using fetchText('html') for non-html types? Because they are both basically getting what the browser is showing.

And, add a note for getPageContent #178 (comment)

casper.test.begin('fetchText() handles non-html pages', 1, function(test) {
    casper.start().then(function() {
        this.setContent('{"bar":"foo"}');
        test.assertEquals(JSON.parse(this.fetchText('html'))['bar'], "foo",
            'Casper.fetchText() handles non-html pages');
    });
    casper.run(function() {
        test.done();
    });
});

Conclusion:

We were chasing after a mirage. The browser automatically adds html tags, casper doesn't do that. Here, Getting that raw data is basically getting html data that was generated by the browser. So, we already had a nice solution called fetchText('html') for raw page data.
Can anyone verify it with other types of data like text/plain, etc?

Collaborator

istr commented May 14, 2016

The browser automatically adds html tags, casper doesn't do that

Yes, like stated in my comment above:
#178 (comment) (last sentence of the EDIT).

a nice solution called fetchText('html') for raw page data

I guess the point is, that this solution is not so very nice (it is not obvious at all) and not well documented. And as far as I can tell (need to re-check, I might be wrong here) there is a slight difference:

  • fetchText('html') fetches the serialized and concatenated DOM text nodes while
  • this.page.framePlainText returns the actual (unprocessed) response text

with the effect that the former should have sanitized / normalized entities while the latter has not (in case it is an HTML document).

So I still think we would be better off with the "four lines [that] could do the trick", because these would be obvious, even for newcomers to CasperJS.

Contributor

entrptaher commented May 14, 2016

casper.test.begin('getPlainText() and fetchText("html") handles texts with no CONTENT TYPE header', 1, function(test) {
casper.start('https://raw.githubusercontent.com/casperjs/casperjs/master/LICENSE.md').then(function(){
  test.assert(this.getPlainText()==this.fetchText('html'), 'both reads markdown files as plain text');
});
casper.thenOpen('https://raw.githubusercontent.com/casperjs/casperjs/master/Makefile').then(function(){
  test.assert(this.getPlainText()==this.fetchText('html'), 'both reads Makefile files as plain text');
});
casper.thenOpen('http://docs.casperjs.org/en/latest/index.html').then(function(){
  test.assert(this.getPlainText()==this.fetchText('html'), "both shows only visible part of the page as plain text");
});
casper.thenOpen('http://docs.casperjs.org/en/latest/license.html').then(function(){
  test.assert(this.getPlainText()==this.fetchText('html'), "both shows only visible part of the page as plain text");
});
casper.thenOpen('http://md5.jsontest.com/?text=example_text').then(function(){
  test.assert(this.getPlainText()==this.fetchText('html'), "JSON files gives same output too.");
});
casper.thenOpen('http://md5.jsontest.com/?text=example_text').then(function(){
  test.assert(this.getPlainText()==this.fetchText('html'), "JSON files gives same output too.");
});
casper.run();
})
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment