Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gizmodo/lifehacker don't work (they store text in JSON-LD now) #898

Closed
vshabanov opened this issue Aug 9, 2021 · 3 comments
Closed

gizmodo/lifehacker don't work (they store text in JSON-LD now) #898

vshabanov opened this issue Aug 9, 2021 · 3 comments

Comments

@vshabanov
Copy link
Contributor

vshabanov commented Aug 9, 2021

Hi, both gizmodo and lifehacker now store article text in JSON-LD:

<script type="application/ld+json">...</script>

Could you, please, add support for JSON-LD documents?

@fivefilters
Copy link
Owner

fivefilters commented Aug 9, 2021

Hi Vladimir, I think they still contain the content, but for some reason the HTML5 parser seems not to be able to parse it propertly.

In my tests switching to the libxml parser does appear to work. Are you able to try the updated site config for gizmodo and see if you have any luck: https://github.com/fivefilters/ftr-site-config/blob/master/gizmodo.com.txt

If you don't, can you please give us a URL so we can test with it and see if there's another solution.

When I looked the content in JSON-LD does not contain any HTML markup, and at the moment the Full-Text RSS code only uses JSON-LD for other metadata, not content.

@vshabanov
Copy link
Contributor Author

Oh, that's great. I've tried to search for some text but they randomly insert <!-- --> comments in HTML so I haven't found it.

Both Gizmodo and Lifehacker are working now. Thank you!

@fivefilters
Copy link
Owner

fivefilters commented Aug 11, 2021

It's pretty strange. The HTML they serve is a mess. If you disable Javascript in the browser, nothing loads (just a blank screen) and Firefox's developer tools don't seem to be able parse the HTML without Javascript for some reason, so you get a very minimal DOM tree from the result of Firefox's parsing compared to the actual source HTML returned by the server. I think that's what's happening when Full-Text RSS tries to use the HTML5 parser (HTML5PHP) too, although I haven't tested extensively. But the HTML does contain the content, and libxml can parse it, so that's the main change in the site config file that makes this work again: parser: libxml. Very odd.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants