Truly fix #128: Pages with no 'Content-Type' header fail #135

ghost · 2011-08-23T08:47:00Z

These patches have been tested and verified to work with unsupported content. :)

Please check the changes and let me know if anything needs to be fixed/altered.

Concept by Roejames12 C++ patch by Niek (with edits by Roejames12)

ghost · 2011-08-23T09:08:28Z

Hold off on merging for now. Need to check out an issue; the current way we have causes parent frames HTML content to be overwritten with child frames, when a child frame has no 'Content-Type' header.

Concepts and ideas by Niek and I C++ patch by Niek

ghost · 2011-08-24T21:15:39Z

It is now completely fixed, tested, and working. :)

ariya · 2011-08-25T03:56:29Z

I'm still unsure about

frame.setHtml(data, reply.url())

Why is it necessary to reset the frame content?

ghost · 2011-08-25T04:59:04Z

Using this script:

var page = new WebPage(),
    page2 = new WebPage();

// this URL has no Content-Type
page.open('http://dubbelboer.com:9090/', function(status) {
    console.log('No Content-Type returns ' + status + ' with the following content:');
    console.log(page.content);
});

// this URL has a Content-Type
page2.open('http://dubbelboer.com:9090/html', function(status) {
    console.log('Content-Type returns ' + status + ' with the following content:');
    console.log(page2.content);
});

We will get these results with frame.setHtml:

No Content-Type returns success with the following content:
<html><head><title>test</title></head><body><img src="http://www.adperium.com/images/logo.png"></body></html>
Content-Type returns success with the following content:
<html><head><title>test</title></head><body><img src="http://www.adperium.com/images/logo.png"></body></html>

Now, removing and replacing the setHtml call with a filler self.m_webPage.loadFinished.emit(True) will return:

No Content-Type returns success with the following content:
<html><head></head><body></body></html>
Content-Type returns success with the following content:
<html><head><title>test</title></head><body><img src="http://www.adperium.com/images/logo.png"></body></html>

So you can see, that unless we get the replies data, and use the setHtml call, the frame will have nothing in it except what a standard about:blank page will have. This makes complete sense, since I wouldn't except Qt to handle and set the frames Html to the replies data since it's unsupported content (it seems it might be possible that QtWebKit isn't doing autodetection of the type of document, so it doesn't know what to do when 'Content-Type' is missing).

From the Qt Documentation:

unsupportedContent
This signal is emitted when WebKit cannot handle a link the user navigated to ...
So that would assume that if Qt emitted this signal, it couldn't handle the content (hence unsupportedContent), so that's why the signal is being emitted. So if it's unsupported, it shouldn't be setting the frames HTML content, otherwise we could end up with nasty stuff in the webpage (think a binary file)!

ariya · 2011-08-25T05:51:00Z

What happen if the "unsupported" content is not HTML/XML?

ghost · 2011-08-25T05:52:59Z

I actually tested that out just now. What happens is that the reply starts getting downloaded (and eventually we will call setHtml on the data!).

So right now I'm looking for a way to detect somehow if the content is a HTML document vs not, so we can ignore it. I figure this can be a good time to also implement a way to download files.

A signal such as page.onDownloadRequest, which has a callback where you can open a file (write+binary is best of course) and save the data to a file. It also needs a way to reject/approve the file download (it would need to also pause the reply until the user has indicated what to do).

We don't have to do that part, but it's best we do that along with this, to complete everything; but if not, still need a way to reject everything that isn't a HTML document.

Edit: I've updated issue 128 with more specifics. Also know that to detect what document mime type we have we can use readyRead()

erikdubbelboer · 2011-08-25T13:57:48Z

You should be able to do some content sniffing inside the downloadProgress signal, aborting the reply once you haven't found any html tags in the first n bytes.

I tried writing it myself but I ended up with some strange infinite request loop.

I added an iframe without any html tags to http://dubbelboer.com:9090/ to test it.

ghost · 2011-08-31T06:09:37Z

Preliminary MIME sniffing is now complete in PyPhantomJS. Check issue 128 for more.

ghost · 2011-09-04T12:41:38Z

Python implementation is ready. I'm just waiting on the C++ one at the moment (I can imagine it will take awhile, as this code is complex). Default types allowed through are: text, html, xml, and images. Everything else is aborted and sending a failed signal to the page.

erikdubbelboer · 2011-09-04T12:58:16Z

I have fixed everything in C except for the whole mime sniffer. Found some bugs and made some improvements on it which will need to be done to the Python implementation as well.

As you can see I have modified http://dubbelboer.com:9090 for some extra testing.

I'll upload a patch tomorrow.

ghost · 2011-09-04T13:05:21Z

As you can see I have modified http://dubbelboer.com:9090 for some extra testing.

Ya, pretty cool. My first test on the new page passed flawlessly. :)

Found some bugs and made some improvements on it which will need to be done to the Python implementation as well.

Sweet, can't wait to see! :)

erikdubbelboer · 2011-09-04T16:34:39Z

Ya, pretty cool. My first test on the new page passed flawlessly. :)

Actually it's requesting all the content 2 or 3 times because the body of the main frame gets reset every time a request finishes. You can fix this by removing the reply from m_replies after the setContent. I noticed this in the C version as well and have fixed it.

I just finished the mime sniffer in C, need to fix one thing and then I'll send a patch tomorrow.

By the way I also noticed that images in img tags without content-type headers don't fire unsupportedContent but just work. So they best way to test the sniffer for images is to load http://dubbelboer.com:9094/ only.

erikdubbelboer · 2011-09-04T17:05:41Z

I modified the test page again (try looking at it in a normal browser first). I don't think only looking at the first readReady will always work since it might not contain enough bytes to sniff. I already fixed this in the C version by waiting for 512 bytes or the end of the request (if it's less then 512).

ghost · 2011-09-05T00:47:38Z

Actually it's requesting all the content 2 or 3 times because the body of the main frame gets reset every time a request finishes. You can fix this by removing the reply from m_replies after the setContent. I noticed this in the C version as well and have fixed it.

Nice catch. I also overlooked that readyRead is fired more than once, so we need to cancel subsequent calls too it somehow (you may have already done that).

I don't think only looking at the first readReady will always work since it might not contain enough bytes to sniff. I already fixed this in the C version by waiting for 512 bytes or the end of the request (if it's less then 512).

Try for 1024 bytes to be more lenient. I know that 512 bytes works, but on some very rare cases (sniffing for binary content) we needed 1024 bytes to catch it. I already use 1024 bytes in the MimeSniffer, but as you said, waiting for the right amount or end of request is much more optimal. Looking forward to seeing your work. :)

erikdubbelboer · 2011-09-05T09:17:51Z

Here are my diffs and files:
http://dubbelboer.com/phantomjs/mimesniffer.cpp
http://dubbelboer.com/phantomjs/mimesniffer.h
http://dubbelboer.com/phantomjs/networkreplyproxy.cpp.diff
http://dubbelboer.com/phantomjs/networkreplyproxy.h.diff
http://dubbelboer.com/phantomjs/phantomjs.pro.diff
http://dubbelboer.com/phantomjs/webpage.cpp.diff
http://dubbelboer.com/phantomjs/webpage.h.diff

I have also added SWF files to the mime sniffer.

I think 512 bytes should be enough seeing that other browsers seem to use 512 as well (except for IE, who only does 256).

I think the code I have written could probably be improved by moving a lot of it to networkreplyproxy, that way you don't have to loop over all the replies all the time. Another way to improve it might be to use sender() to get the reply that send the signal. This is my first time working with Qt so I'm not sure how to do this and I hope someone else will.

ghost · 2011-09-05T23:56:48Z

Looks good. I'll try to look it all over when I have more time, and give you pointers if I see anything.

ariya · 2011-09-06T22:39:40Z

Shall I merge this or are you going to create another pull request with the complete changes?

ghost · 2011-09-06T22:40:53Z

I'm going to continue to add commits until I'm done, then you can merge it. I still have the changes to do, and also possibly changing a few things. (Unless you want me to close this pull request, and squash all the commits)

Edit: You know, squashing everything together would be far better anyways. I'll close this, finish my work, squash it all, then open a new request. Any more discussion can take place on issue 128.

Support pages which have no 'Content-Type' header. Fixes #128

90555fc

Concept by Roejames12 C++ patch by Niek (with edits by Roejames12)

Completely finish handling unsupportedContent

56fefd9

Concepts and ideas by Niek and I C++ patch by Niek

Check that Content-Disposition isn't 'attachment'

f7ce96a

Python: Use MIME sniffing to filter what content to allow

af90904

ghost closed this Sep 6, 2011

ariya mentioned this pull request Mar 15, 2013

phantomjs fails to load any page that does not set a Content-Type header #10128

Closed

This pull request was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Truly fix #128: Pages with no 'Content-Type' header fail #135

Truly fix #128: Pages with no 'Content-Type' header fail #135

ghost commented Aug 23, 2011

ghost commented Aug 23, 2011

ghost commented Aug 24, 2011

ariya commented Aug 25, 2011

ghost commented Aug 25, 2011

ariya commented Aug 25, 2011

ghost commented Aug 25, 2011

erikdubbelboer commented Aug 25, 2011

ghost commented Aug 31, 2011

ghost commented Sep 4, 2011

erikdubbelboer commented Sep 4, 2011

ghost commented Sep 4, 2011

erikdubbelboer commented Sep 4, 2011

erikdubbelboer commented Sep 4, 2011

ghost commented Sep 5, 2011

erikdubbelboer commented Sep 5, 2011

ghost commented Sep 5, 2011

ariya commented Sep 6, 2011

ghost commented Sep 6, 2011

Truly fix #128: Pages with no 'Content-Type' header fail #135

Truly fix #128: Pages with no 'Content-Type' header fail #135

Conversation

ghost commented Aug 23, 2011

ghost commented Aug 23, 2011

ghost commented Aug 24, 2011

ariya commented Aug 25, 2011

ghost commented Aug 25, 2011

ariya commented Aug 25, 2011

ghost commented Aug 25, 2011

erikdubbelboer commented Aug 25, 2011

ghost commented Aug 31, 2011

ghost commented Sep 4, 2011

erikdubbelboer commented Sep 4, 2011

ghost commented Sep 4, 2011

erikdubbelboer commented Sep 4, 2011

erikdubbelboer commented Sep 4, 2011

ghost commented Sep 5, 2011

erikdubbelboer commented Sep 5, 2011

ghost commented Sep 5, 2011

ariya commented Sep 6, 2011

ghost commented Sep 6, 2011