-
Notifications
You must be signed in to change notification settings - Fork 5.8k
Truly fix #128: Pages with no 'Content-Type' header fail #135
Conversation
Concept by Roejames12 C++ patch by Niek (with edits by Roejames12)
Hold off on merging for now. Need to check out an issue; the current way we have causes parent frames HTML content to be overwritten with child frames, when a child frame has no 'Content-Type' header. |
Concepts and ideas by Niek and I C++ patch by Niek
It is now completely fixed, tested, and working. :) |
I'm still unsure about frame.setHtml(data, reply.url()) Why is it necessary to reset the frame content? |
Using this script: var page = new WebPage(),
page2 = new WebPage();
// this URL has no Content-Type
page.open('http://dubbelboer.com:9090/', function(status) {
console.log('No Content-Type returns ' + status + ' with the following content:');
console.log(page.content);
});
// this URL has a Content-Type
page2.open('http://dubbelboer.com:9090/html', function(status) {
console.log('Content-Type returns ' + status + ' with the following content:');
console.log(page2.content);
}); We will get these results with frame.setHtml:
Now, removing and replacing the setHtml call with a filler
So you can see, that unless we get the replies data, and use the From the Qt Documentation: unsupportedContent |
What happen if the "unsupported" content is not HTML/XML? |
I actually tested that out just now. What happens is that the reply starts getting downloaded (and eventually we will call setHtml on the data!). So right now I'm looking for a way to detect somehow if the content is a HTML document vs not, so we can ignore it. I figure this can be a good time to also implement a way to download files. A signal such as We don't have to do that part, but it's best we do that along with this, to complete everything; but if not, still need a way to reject everything that isn't a HTML document. Edit: I've updated issue 128 with more specifics. Also know that to detect what document mime type we have we can use readyRead() |
You should be able to do some content sniffing inside the downloadProgress signal, aborting the reply once you haven't found any html tags in the first n bytes. I tried writing it myself but I ended up with some strange infinite request loop. I added an iframe without any html tags to http://dubbelboer.com:9090/ to test it. |
Preliminary MIME sniffing is now complete in PyPhantomJS. Check issue 128 for more. |
Python implementation is ready. I'm just waiting on the C++ one at the moment (I can imagine it will take awhile, as this code is complex). Default types allowed through are: text, html, xml, and images. Everything else is aborted and sending a failed signal to the page. |
I have fixed everything in C except for the whole mime sniffer. Found some bugs and made some improvements on it which will need to be done to the Python implementation as well. As you can see I have modified http://dubbelboer.com:9090 for some extra testing. I'll upload a patch tomorrow. |
Ya, pretty cool. My first test on the new page passed flawlessly. :)
Sweet, can't wait to see! :) |
Actually it's requesting all the content 2 or 3 times because the body of the main frame gets reset every time a request finishes. You can fix this by removing the reply from m_replies after the setContent. I noticed this in the C version as well and have fixed it. I just finished the mime sniffer in C, need to fix one thing and then I'll send a patch tomorrow. By the way I also noticed that images in img tags without content-type headers don't fire unsupportedContent but just work. So they best way to test the sniffer for images is to load http://dubbelboer.com:9094/ only. |
I modified the test page again (try looking at it in a normal browser first). I don't think only looking at the first readReady will always work since it might not contain enough bytes to sniff. I already fixed this in the C version by waiting for 512 bytes or the end of the request (if it's less then 512). |
Nice catch. I also overlooked that readyRead is fired more than once, so we need to cancel subsequent calls too it somehow (you may have already done that).
Try for 1024 bytes to be more lenient. I know that 512 bytes works, but on some very rare cases (sniffing for binary content) we needed 1024 bytes to catch it. I already use 1024 bytes in the MimeSniffer, but as you said, waiting for the right amount or end of request is much more optimal. Looking forward to seeing your work. :) |
Here are my diffs and files: I have also added SWF files to the mime sniffer. I think 512 bytes should be enough seeing that other browsers seem to use 512 as well (except for IE, who only does 256). I think the code I have written could probably be improved by moving a lot of it to networkreplyproxy, that way you don't have to loop over all the replies all the time. Another way to improve it might be to use sender() to get the reply that send the signal. This is my first time working with Qt so I'm not sure how to do this and I hope someone else will. |
Looks good. I'll try to look it all over when I have more time, and give you pointers if I see anything. |
Shall I merge this or are you going to create another pull request with the complete changes? |
I'm going to continue to add commits until I'm done, then you can merge it. I still have the changes to do, and also possibly changing a few things. (Unless you want me to close this pull request, and squash all the commits) Edit: You know, squashing everything together would be far better anyways. I'll close this, finish my work, squash it all, then open a new request. Any more discussion can take place on issue 128. |
These patches have been tested and verified to work with unsupported content. :)
Please check the changes and let me know if anything needs to be fixed/altered.