Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Linkedin, Wayback Machine, and W3C Validator issues - no indexing #1639

Closed
glasswork opened this issue Sep 5, 2017 · 28 comments
Closed

Linkedin, Wayback Machine, and W3C Validator issues - no indexing #1639

glasswork opened this issue Sep 5, 2017 · 28 comments

Comments

@glasswork
Copy link

glasswork commented Sep 5, 2017

I am having the same issue as referenced in the Grav Community Forum archive:
https://discourse.getgrav.org/t/linkedin-url-sharing-not-working/1818

So that you need not refer to it, here is the content of that post:


We are using OG matedata on our pages for social sharing and Facebook and Twitter are working OK. We have an issue/problem with LinkedIn sharing - it does not pull content from defined metadata or page itself.

We checked a combination of metadata by adding ?1 (?2, etc.) to end of URL to avoid LinkedIn cache issue but there was no progress. Then, we saved one of the pages source code as an html file and put it back on the server - and it worked! LinkedIn sharing was reading OG metadata and it was OK (as it was for other socil networks).

Does anyone have any idea what should we do to enable Linkedin URL sharing with Grav CMS?

Best regards,

Vladimir


Like Vladimir (above), I can’t share pages on Linkedin. I know it is not an Apache conf file issue, or an htaccess issue, or a bad code issue because I can create a static version of the page and it shares just fine.

I do not have caching or compression enabled.

To test, I can type the following urls into LinkedIn and see if a page preview appears:

  • /news (does not have preview on LinkedIn)
  • /new.html (does have preview on LinkedIn)

The first url is a grav-controlled url, the latter is a static html reproduction of the news page from grav which is at the site root.

Server is a typical ubuntu mamp style setup (without the MySQL) hosted on AWS. I regularly check permissions and make sure these bases are covered:

sudo chown -R ubuntu:www-data .
find . -type f | xargs chmod 664
find ./bin -type f | xargs chmod 775
find . -type d | xargs chmod 775
find . -type d | xargs chmod +s

Any help appreciated.

Thanks!

@glasswork
Copy link
Author

Oh, and this is also true for the WayBack machine ... it refuses to index grav-controlled pages but indexes the static version of pages just fine.

@glasswork glasswork changed the title Linkedin sharing issue - no preview Linkedin and Wayback Machine issues - no indexing Sep 5, 2017
@rhukster
Copy link
Member

rhukster commented Sep 5, 2017

Quite odd, I will investigate this!

@rhukster rhukster added the bug label Sep 5, 2017
@rhukster
Copy link
Member

rhukster commented Sep 6, 2017

I'm pretty sure this is related to linked in caching. For example the getgrav.org site does show when shared in linkedin, but my local machine will not. Everything i've found seems to indicate that linked in caches and stores that cache for 7 days, so no matter what you do after that to resolve an issue, it won't update.

It doesn't seem that linkedin has an opengraph validator, but according to this, the page is good: http://opengraphcheck.com/result.php?url=http%3A%2F%2Fideum.com%2Fnews#.Wa9MldOGPUY

I'm not sure what it could be. maybe try putting prefix="og: http://ogp.me/ns#" in your <html> tag? Although this might not take effect for a week :(

@glasswork
Copy link
Author

glasswork commented Sep 6, 2017

I too looked up Linkedin caching; before posting about this I had already waited the 7 days so I just don't see how that can be the issue. I also tried the recommended addition of a url param, but that didn't work.

Further Testing
I made a brand new folder at grav site root called /newtons and put a file in it called index.html. The index is a slightly modified version of the 'new.html' page I tested previously in that it had different ogp data and some small differences to the page textual content. (It was a new static version of the newly updated 'news' pg.)

The Wayback Machine instantly accepted it as did Linkedin.

So that page, which is outside of grav control, though identical to an existing grav-controlled page (news), was indexed by both services.

I tested numerous grav-controlled pages that we had adjusted the slugs for (so as far as LinkedIn was concerned they were effectively new pages) and still no joy. So, again, don't quite see how Linkedin caching is the issue.

I do really appreciate your looking into this. It is indeed very odd.

@glasswork
Copy link
Author

Oh, I always test my ogp metadata via Facebook's debugger ;)

@rhukster
Copy link
Member

rhukster commented Sep 6, 2017

Is it possible that you have some security, .htaccess rule, or robots.txt file in place that could be stopping it? I'm not able to test your site via: https://validator.w3.org

I get an IO error for some reason, but

@rhukster
Copy link
Member

rhukster commented Sep 6, 2017

FYI:

Oh boy – it was my error, of course. In my user agent blocking code in .htaccess I intended to block a user agent that started with “validator” and continued-on. But I mistakenly wrote this, which blocked the w3c validator site!
SetEnvIfNoCase User-Agent ^Validator bad_bot

@glasswork
Copy link
Author

glasswork commented Sep 6, 2017

I don't have any agent blocking in my htaccess, nor do I have any denials in my robots.txt or Apache conf files.

I have run files and images through google robots.txt checker and all are allowed.

I have run the permissions exactly as they appear in the Grav docs. (see my original post)

I will, in good faith, walk line by line through the htaccess and see if I can turn up anything, but remember - all the faux files and folders I created would fall under any htaccess prohibitions and I have no grav-specific prohibitions anywhere.

It is a real head-scratcher.

@rhukster
Copy link
Member

rhukster commented Sep 6, 2017

I do feel like it has something to do with the fact that the w3c checker can't reach your site. When that is working, i think linkedin will start working too.

@rhukster
Copy link
Member

rhukster commented Sep 6, 2017

Also you might want to check with your hosting provider as .htaccess rules can be in place from their webserver setup and just 'extended' by your local htaccess.

@glasswork
Copy link
Author

glasswork commented Sep 6, 2017

Agreed.

But wIth regard to second suggestion: we run a clean AWS instance that we create, spin up, and provision- no web host nonsense ;)

@glasswork
Copy link
Author

I tested in Screaming Frog and I got a 200 response code for all my pages and images.

@glasswork
Copy link
Author

glasswork commented Sep 7, 2017

It is definitely not the Linkedin cache - I made a new A name record which would mean a new (unshared) url and pointed it to a copy of the grav site with a clean and clear htaccess file and changed some slugs and still nothing.

I am giving up and will be making a static version of the site with the original htaccess i was using for the grav version of the site. If that works, I will remove grav from the production server and just use it locally or jettison grav altogether. though that defeats the purpose of having a cms which was for others to be able to edit online.

Thanks for your help.

@rhukster
Copy link
Member

rhukster commented Sep 7, 2017

I still don't think it's actually a Grav issue itself. Getgrav.org is using an ubuntu 14.04 linode VPN with serverpilot.io managing nginx/apache/php and Linkedin is picking up the standard OG metadata we added with Grav's built-in metadata support.

Maybe its a plugin that's affecting things, but it's more likely some server configuration that is blocking both the w3c validator and linkedin. It seems linked in is definitely more picky about things than facebook/twitter, but still, it should work like it does with getgrav.org. I mean even if there are no OG meta tags, linked in should be able to give a basic preview of the site, it's like it can't even reach it, or is blocked.

@glasswork
Copy link
Author

glasswork commented Sep 7, 2017

I am not certain it is Grav itself either, but if not, it is almost certainly an interaction between something specific to Grav and a very vanilla ubuntu server setup. I just tested a clean Wordpress install and a clean Grav install. The former worked fine, the latter did not.

So, it may not be Grav itself, but Grav is somehow part of the issue. I (and probably you) don't have time to hunt down the precise Grav-server interaction that is causing the problem. : )

If I do find the issue I will let you know.

@glasswork
Copy link
Author

glasswork commented Sep 8, 2017

Blackhole plugin would not create a full static site (stopped at about 10 pages). So ...

The non-indexing issue seems to be an interaction between Grav and AWS servers.
I have tried a clean clone of basic Grav site (right from your github repo) on several AWS servers and all had the same issue (all were ubuntu setups).
Tried clean install in a hosted service and it worked.
Static pages at Grav site root work on all servers.
Non-Grav sites (static and dynamic w/ various cms flavors) work on all servers.

Still trying to track down the precise interaction creating the issue.

No need to respond unless you have a suggestion. I am just trying to record all my tests here.

@glasswork
Copy link
Author

Disabled all plugins ... no joy.

@glasswork
Copy link
Author

glasswork commented Sep 11, 2017

Created new instance of site on new server. Pointed subdomain to it.
Validator can view the home page md file if I provide a direct path to the file (no IO - Error).
If I create a static version of that pg (home) from the compiled code and name it index.html and put it in user/pages/01.home and let the validator have the url (or the path - doesn't matter), it is also indexed (no IO - Error).

@glasswork
Copy link
Author

glasswork commented Sep 11, 2017

And I can reach '/user/pages/01.home' by providing that path (no IO - Error).
but '/user/pages/home' throws the IO Error, as does '/home' and '/'.

@glasswork
Copy link
Author

So, after extensive testing, every folder and file is reachable (no IO - Error) except when the url is used rather than the server paths. That is, the moment '/user/pages/01.home' becomes '/home' or '/'. Yet, every other site on that testing server works via url or server path.

This seems related to: https://discourse.getgrav.org/t/problem-with-https-validator-w3-org-nu/4176

@rhukster
Copy link
Member

Could you try disabling the shutdown: close_connection setting in your user/config/system.yaml?

https://learn.getgrav.org/basics/grav-configuration#debugger

@glasswork
Copy link
Author

glasswork commented Sep 12, 2017

I have solved the issue, but still don't quite get the why of it or the scope.
Anyway, one MUST have Gzip compression enabled.
Although I didn't find anything in the headers that indicated that the server was saying files were being served up compressed, nevertheless that seems to have been the issue.
What I wonder is if this is limited to only certain AWS servers or if it has wider scope.

@rhukster
Copy link
Member

It would be good to know if its related to the close_connection, it might be.. So please try that also, please try this setting while keeping gzip: false

  allow_webserver_gzip: true

@glasswork
Copy link
Author

Yes, If I understood what you wanted of me.
I could use either WebServer Gzip or Gzip compression or both, but at least one had to be enabled.

@glasswork
Copy link
Author

glasswork commented Sep 12, 2017

FYI - Grav threw an error when I turned on Gzip compression. But there was still a small save button in the upper right corner so I clicked through the error and after that all was well.

Update - Fix is on production and working great. W3C Validator, LinkedIn and Wayback Machine are all happy.

@glasswork
Copy link
Author

screen-shot-2017-09-12-at-9 33 00-am

This is the error...

@glasswork glasswork changed the title Linkedin and Wayback Machine issues - no indexing Linkedin, Wayback Machine, and W3C Validator issues - no indexing Sep 12, 2017
@glasswork
Copy link
Author

glasswork commented Sep 12, 2017

Thanks, @rhukster! That should save someone else a headache.

@rhukster
Copy link
Member

I'm going to close this issue, but have marked it for documenation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants