Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The Verge site config does not pull images #283

Closed
n00b12345 opened this issue Apr 1, 2017 · 14 comments
Closed

The Verge site config does not pull images #283

n00b12345 opened this issue Apr 1, 2017 · 14 comments

Comments

@n00b12345
Copy link

n00b12345 commented Apr 1, 2017

I've played around with it a lot. It does not pull any images. This is happening with a lot of sites for me, but since the verge's feed is easily testable, I'm reporting that as the test case.

@fivefilters
Copy link
Owner

Could you provide an example. I just tried the first article that loaded and go this result (plenty of images):

http://ftr.fivefilters.org/makefulltextfeed.php?url=www.theverge.com%2F2017%2F4%2F1%2F14969400%2Fsci-fi-fantasy-books-recommendations-april-2017

@n00b12345
Copy link
Author

I'll try to explain.

Even in the link you provided, if you visit the verge page, you'll notice that the header image or the first image (however you want to call it) doesn't load. Now for a lot of pages, this is usually the only image for the entire post.

Take a look at this page, it loads no image, but the verge website certainly has one:

http://ftr.fivefilters.org/makefulltextfeed.php?url=http%3A%2F%2Fwww.theverge.com%2Fcircuitbreaker%2F2017%2F3%2F31%2F15129708%2Fapple-usb-c-accessories-cables-adaptors-discount-macbook-pro

@fivefilters
Copy link
Owner

Ah, I see what you mean. Yes, this is an issue which we hope to improve. The main problem is that these feature images are often outside the main body element. It's possible to include them with custom rules (I'll try to add one for theverge.com) but the ideal solution would be something a little smarter that can try to detect them.

A few versions ago we actually added code to Full-Text RSS that would look for the og:image meta element and insert that into the start of the extracted article if and only if the extracted HTML contained no image elements. I need to see why it's not working for the example you provided, as it should really be including this image.

@n00b12345
Copy link
Author

I've tried this using varying xpath patterns but the output remains the same. Another feed which has a similar issue is that of nytimes.com

For some weird reason even when I select the topmost of div elements, the main image (placeholder image) is always skipped. Same with this.

@fivefilters
Copy link
Owner

fivefilters commented Apr 1, 2017

Just updated the site config for The Verge, so this issue should be fixed for this site if you try the links above again.

This line in the site config

body: //div[contains(@class, 'c-entry-content') or contains(@class, 'c-entry-hero__image')]

was changed to

body: //picture[contains(@class, 'c-picture')] | //div[contains(@class, 'c-entry-content') or contains(@class, 'c-entry-hero__image')]

@n00b12345
Copy link
Author

Thanks a lot. For some weird reason, my own installation, v 3.5 doesn't show images even with the latest config update. Strange.

@n00b12345
Copy link
Author

Could you test if this works with older versions? I tried your config with 2 older versions and they didn't work.

Thanks again. Appreciate the help.

@fivefilters
Copy link
Owner

No time at the moment to test older versions. But I can't see why it'd be an issue. Might have something to do with the parser being used or the lazy image replacement. My suggestion is try enabling debug on our hosted version and your own version and compare the results.

@n00b12345
Copy link
Author

Great suggestion. Appreciate all the help. @fivefilters

@n00b12345
Copy link
Author

n00b12345 commented Apr 2, 2017

I solved it. Turning off the html5php parser resolved the issue. Must be something with my system. @fivefilters

@n00b12345
Copy link
Author

n00b12345 commented Apr 2, 2017

I'll just put this link here. The same thing happens with NyTimes too. The header image is missing.

http://ftr.fivefilters.org/makefulltextfeed.php?url=https%3A%2F%2Fmobile.nytimes.com%2F2017%2F03%2F30%2Ftechnology%2Fuber-waymo-levandowski.html&max=1

@fivefilters
Copy link
Owner

@n00b12345
Copy link
Author

n00b12345 commented Apr 3, 2017

@n00b12345
Copy link
Author

n00b12345 commented Apr 3, 2017

It does now, yay!

Actually sometimes it does, sometimes it doesn't. Weird.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants