Skip to content
This repository has been archived by the owner on May 30, 2023. It is now read-only.

Truly fix #128: Pages with no 'Content-Type' header fail #135

Closed
wants to merge 4 commits into from
Closed

Truly fix #128: Pages with no 'Content-Type' header fail #135

wants to merge 4 commits into from

Conversation

ghost
Copy link

@ghost ghost commented Aug 23, 2011

These patches have been tested and verified to work with unsupported content. :)

Please check the changes and let me know if anything needs to be fixed/altered.

Concept by Roejames12
C++ patch by Niek (with edits by Roejames12)
@ghost
Copy link
Author

ghost commented Aug 23, 2011

Hold off on merging for now. Need to check out an issue; the current way we have causes parent frames HTML content to be overwritten with child frames, when a child frame has no 'Content-Type' header.

Concepts and ideas by Niek and I
C++ patch by Niek
@ghost
Copy link
Author

ghost commented Aug 24, 2011

It is now completely fixed, tested, and working. :)

@ariya
Copy link
Owner

ariya commented Aug 25, 2011

I'm still unsure about

frame.setHtml(data, reply.url())

Why is it necessary to reset the frame content?

@ghost
Copy link
Author

ghost commented Aug 25, 2011

Using this script:

var page = new WebPage(),
    page2 = new WebPage();

// this URL has no Content-Type
page.open('http://dubbelboer.com:9090/', function(status) {
    console.log('No Content-Type returns ' + status + ' with the following content:');
    console.log(page.content);
});

// this URL has a Content-Type
page2.open('http://dubbelboer.com:9090/html', function(status) {
    console.log('Content-Type returns ' + status + ' with the following content:');
    console.log(page2.content);
});

We will get these results with frame.setHtml:

No Content-Type returns success with the following content:
<html><head><title>test</title></head><body><img src="http://www.adperium.com/images/logo.png"></body></html>
Content-Type returns success with the following content:
<html><head><title>test</title></head><body><img src="http://www.adperium.com/images/logo.png"></body></html>

Now, removing and replacing the setHtml call with a filler self.m_webPage.loadFinished.emit(True) will return:

No Content-Type returns success with the following content:
<html><head></head><body></body></html>
Content-Type returns success with the following content:
<html><head><title>test</title></head><body><img src="http://www.adperium.com/images/logo.png"></body></html>

So you can see, that unless we get the replies data, and use the setHtml call, the frame will have nothing in it except what a standard about:blank page will have. This makes complete sense, since I wouldn't except Qt to handle and set the frames Html to the replies data since it's unsupported content (it seems it might be possible that QtWebKit isn't doing autodetection of the type of document, so it doesn't know what to do when 'Content-Type' is missing).

From the Qt Documentation:

unsupportedContent
This signal is emitted when WebKit cannot handle a link the user navigated to ...
So that would assume that if Qt emitted this signal, it couldn't handle the content (hence unsupportedContent), so that's why the signal is being emitted. So if it's unsupported, it shouldn't be setting the frames HTML content, otherwise we could end up with nasty stuff in the webpage (think a binary file)!

@ariya
Copy link
Owner

ariya commented Aug 25, 2011

What happen if the "unsupported" content is not HTML/XML?

@ghost
Copy link
Author

ghost commented Aug 25, 2011

I actually tested that out just now. What happens is that the reply starts getting downloaded (and eventually we will call setHtml on the data!).

So right now I'm looking for a way to detect somehow if the content is a HTML document vs not, so we can ignore it. I figure this can be a good time to also implement a way to download files.

A signal such as page.onDownloadRequest, which has a callback where you can open a file (write+binary is best of course) and save the data to a file. It also needs a way to reject/approve the file download (it would need to also pause the reply until the user has indicated what to do).

We don't have to do that part, but it's best we do that along with this, to complete everything; but if not, still need a way to reject everything that isn't a HTML document.

Edit: I've updated issue 128 with more specifics. Also know that to detect what document mime type we have we can use readyRead()

@erikdubbelboer
Copy link
Contributor

You should be able to do some content sniffing inside the downloadProgress signal, aborting the reply once you haven't found any html tags in the first n bytes.

I tried writing it myself but I ended up with some strange infinite request loop.

I added an iframe without any html tags to http://dubbelboer.com:9090/ to test it.

@ghost
Copy link
Author

ghost commented Aug 31, 2011

Preliminary MIME sniffing is now complete in PyPhantomJS. Check issue 128 for more.

@ghost
Copy link
Author

ghost commented Sep 4, 2011

Python implementation is ready. I'm just waiting on the C++ one at the moment (I can imagine it will take awhile, as this code is complex). Default types allowed through are: text, html, xml, and images. Everything else is aborted and sending a failed signal to the page.

@erikdubbelboer
Copy link
Contributor

I have fixed everything in C except for the whole mime sniffer. Found some bugs and made some improvements on it which will need to be done to the Python implementation as well.

As you can see I have modified http://dubbelboer.com:9090 for some extra testing.

I'll upload a patch tomorrow.

@ghost
Copy link
Author

ghost commented Sep 4, 2011

As you can see I have modified http://dubbelboer.com:9090 for some extra testing.

Ya, pretty cool. My first test on the new page passed flawlessly. :)

Found some bugs and made some improvements on it which will need to be done to the Python implementation as well.

Sweet, can't wait to see! :)

@erikdubbelboer
Copy link
Contributor

Ya, pretty cool. My first test on the new page passed flawlessly. :)

Actually it's requesting all the content 2 or 3 times because the body of the main frame gets reset every time a request finishes. You can fix this by removing the reply from m_replies after the setContent. I noticed this in the C version as well and have fixed it.

I just finished the mime sniffer in C, need to fix one thing and then I'll send a patch tomorrow.

By the way I also noticed that images in img tags without content-type headers don't fire unsupportedContent but just work. So they best way to test the sniffer for images is to load http://dubbelboer.com:9094/ only.

@erikdubbelboer
Copy link
Contributor

I modified the test page again (try looking at it in a normal browser first). I don't think only looking at the first readReady will always work since it might not contain enough bytes to sniff. I already fixed this in the C version by waiting for 512 bytes or the end of the request (if it's less then 512).

@ghost
Copy link
Author

ghost commented Sep 5, 2011

Actually it's requesting all the content 2 or 3 times because the body of the main frame gets reset every time a request finishes. You can fix this by removing the reply from m_replies after the setContent. I noticed this in the C version as well and have fixed it.

Nice catch. I also overlooked that readyRead is fired more than once, so we need to cancel subsequent calls too it somehow (you may have already done that).

I don't think only looking at the first readReady will always work since it might not contain enough bytes to sniff. I already fixed this in the C version by waiting for 512 bytes or the end of the request (if it's less then 512).

Try for 1024 bytes to be more lenient. I know that 512 bytes works, but on some very rare cases (sniffing for binary content) we needed 1024 bytes to catch it. I already use 1024 bytes in the MimeSniffer, but as you said, waiting for the right amount or end of request is much more optimal. Looking forward to seeing your work. :)

@erikdubbelboer
Copy link
Contributor

Here are my diffs and files:
http://dubbelboer.com/phantomjs/mimesniffer.cpp
http://dubbelboer.com/phantomjs/mimesniffer.h
http://dubbelboer.com/phantomjs/networkreplyproxy.cpp.diff
http://dubbelboer.com/phantomjs/networkreplyproxy.h.diff
http://dubbelboer.com/phantomjs/phantomjs.pro.diff
http://dubbelboer.com/phantomjs/webpage.cpp.diff
http://dubbelboer.com/phantomjs/webpage.h.diff

I have also added SWF files to the mime sniffer.

I think 512 bytes should be enough seeing that other browsers seem to use 512 as well (except for IE, who only does 256).

I think the code I have written could probably be improved by moving a lot of it to networkreplyproxy, that way you don't have to loop over all the replies all the time. Another way to improve it might be to use sender() to get the reply that send the signal. This is my first time working with Qt so I'm not sure how to do this and I hope someone else will.

@ghost
Copy link
Author

ghost commented Sep 5, 2011

Looks good. I'll try to look it all over when I have more time, and give you pointers if I see anything.

@ariya
Copy link
Owner

ariya commented Sep 6, 2011

Shall I merge this or are you going to create another pull request with the complete changes?

@ghost
Copy link
Author

ghost commented Sep 6, 2011

I'm going to continue to add commits until I'm done, then you can merge it. I still have the changes to do, and also possibly changing a few things. (Unless you want me to close this pull request, and squash all the commits)

Edit: You know, squashing everything together would be far better anyways. I'll close this, finish my work, squash it all, then open a new request. Any more discussion can take place on issue 128.

This pull request was closed.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants