Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Search Engine Accessibility #6402

Open
joysfera opened this issue Jan 6, 2019 · 18 comments
Open

Search Engine Accessibility #6402

joysfera opened this issue Jan 6, 2019 · 18 comments

Comments

@joysfera
Copy link

joysfera commented Jan 6, 2019

Expected behavior

Google, Bing and others can index each public post in Friendica and later offer them in their search indexes by direct URL to the given post.

Actual behavior

It seems to me that Google always returns the /search page of Friendica.

I am very eager to get my posts and posts in our forums indexed by search engines. What can I do for that, please?

Steps to reproduce the problem

  1. pick one of your posts with unique keywords that has been online long enough to get indexed
  2. use google.com or bing.com and enter the keywords there, plus the site:domain filter (not sure if bing has it)

For comparison, Google+ (yes, I'm coming from dying Google+) posts are indexed properly, so if you search for say my name "Petr Stehlík" and some keywords you'll get the unique direct URLs to the posts in the form of https://plus.google.com/+PetrStehl%C3%ADk/posts/uniqueID

Friendica version you encountered the problem

2019.01rc

Friendica source (git, zip)

as currently on nerdica.net

PHP version

as currently on nerdica.net

SQL version

as currently on nerdica.net

@MrPetovan
Copy link
Collaborator

Yeah, currently there isn't any sensible SEO in Friendica, and the internal search itself is a mess, sorry about that.

@joysfera
Copy link
Author

joysfera commented Jan 6, 2019

Any idea how to improve it? For example in Google+ each of the messages in the stream (that I suppose is what the crawler gets to see) contains the following snippet:

<a href="./+PetrStehlík/posts/G1oSQWvHhTH" class="eZ8gzf" jsaction="click:WZfesd(preventDefault=true);" jsname="hJglhd" jslog="14487; track:click" aria-label="Full post view"><span class="DPvwYc rRPL7d" aria-hidden="true"></span></a>

In Friendica the full post view direct URL link is hidden in a popup window (in Frio theme, at least) and looks as follows:

<li role="menuitem"><a title="odkaz na zdroj" href="redir/123093?url=display/a85d7459-115c-3262-e803-020860534242" class="navicon plink u-url"><i class="fa fa-external-link" aria-hidden="true"></i> odkaz na zdroj</a></li>

I have no idea if pulling the direct link out of the popup menu would help the crawlers. Actually I don't even know what the crawlers see so it's hard for me to suggest what to improve to help them index it better.

Any idea, please?

@MrPetovan
Copy link
Collaborator

In a popup window? This behavior doesn't sound familiar. And the redir only appears because you're logged in. Please try the search in a private browsing window without logging in.

Otherwise, there is a host of HTML metadata that we could provide to enable search engine crawlers, including sitemaps, page info, etc... but someone™ has to do it.

@joysfera
Copy link
Author

joysfera commented Jan 6, 2019

In a private window it's the same:

<li class="dropdown open">
				<button type="button" class="btn-link dropdown-toggle" data-toggle="dropdown" id="dropdownMenuTools-4371740" aria-haspopup="true" aria-expanded="true"><i class="fa fa-angle-down" aria-hidden="true"></i></button>
				<ul class="dropdown-menu pull-right" role="menu" aria-labelledby="dropdownMenuTools-4371740">
    				<li role="menuitem">
						<a title="odkaz na zdroj" href="https://nerdica.net/display/a85d7459-105c-3245-8fef-ffb119770705" class="navicon plink u-url"><i class="fa fa-external-link" aria-hidden="true"></i> odkaz na zdroj</a>

This is the most crucial thing for me. I can work around bugs in navigation, can remember not to post what I cannot delete, but I selected Friendica over say MeWe.com as I believed my posts would be searchable and search engines would index it for me. I'd very much need this fixed. What can I do in order to help improve the SEO stuff, please? Just let the search engines index the posts, I don't need it super efficient and win some keyword war, nothing like that. Just to let them understand that these are single posts available under given URL.

@MrPetovan
Copy link
Collaborator

The snipped you copied is the top-left dropdown menu showing the original URL of the displayed post, which may be an external link if it was posted on a remote server first. Not sure if it has any impact on SEO though.

I understand your concern, but we don't have a resident SEO expert in the Friendica team, so everything that we may do will necessitate a lot of learning friction, and most of us would rather work on other stuff because it's more convenient.

@joysfera
Copy link
Author

joysfera commented Jan 6, 2019

Definitely work on more convenient things. The development should and needs to be fun.
I'll look into the SEO stuff by myself. Though not only I am not a SEO expert in any way, I am also totally unfamiliar with Friendica internals or anything even remotely related to that. So if you have any useful hints please share them. Thanks.

@joysfera
Copy link
Author

joysfera commented Jan 6, 2019

As for the HTML snippet - it contains the very URL the crawler needs to see and remember, that's why I was searching for it on the page and posting it here. Maybe it's was a nonsense idea in the first place, I don't know.

@MrPetovan
Copy link
Collaborator

Changing HTML templates is pretty straightforward, if you have specific improvements to suggest, I'd be happy to implement them.

@joysfera
Copy link
Author

joysfera commented Jan 7, 2019

Good start is here: https://search.google.com/search-console/welcome
I'm just afraid that I'd need to run my instance of Friendica first in order to prove Google that I "own" the site, so first things first...

@MrPetovan
Copy link
Collaborator

If you're ready to do it, we can certainly help with that.

@annando
Copy link
Collaborator

annando commented Jan 7, 2019

Yeah, we are really happy with every person who contributes stuff!

@tobiasd
Copy link
Collaborator

tobiasd commented Jan 8, 2019

Keep those in mind, who do not wish a good search ability and make any SEO measurements optional.

@MrPetovan
Copy link
Collaborator

This is nonsense, either these people should have all their post private or have a conservative robots.txt file. Everyone else should have their public posts correctly crawled by search engines.

@annando
Copy link
Collaborator

annando commented Jan 10, 2019

We already do have a default robots.txt mechanism (/mod/robots_txt). I suggest to have it configurable so that it is allowed to crawl the profiles and the local community, but not more. No search, no global community, no other page.

The other settings should be some: "Leave me alone" setting.

AFAIK all SEO improvements depend upon the robots.txt settings, so it should be no problem at all, improving the SEO stuff.

@joysfera
Copy link
Author

joysfera commented Jul 28, 2019

So what I have found in the meantime: it's turned out that Friendica itself was OK. If you try searching for say "Petr Stehlík ploché konektory", or "Petr Stehlík ESP8266 z bláta do louže", you'll find perfectly indexed posts under the URL domain/display/MESSAGE_ID (the former on www.friendica.cz domain, the latter on www.libranet.de domain), and it works just great.
Unfortunately, when leaving Google+ I chose to use the www.nerdica.net hosting and there the web crawler indexing DOES NOT WORK: it either remembers the profile page URL (which is unusable because it's a stream of new posts so you don't find the one you search for) or it even indexed a date based page (as https://nerdica.net/profile/joy/2019-02-28/2019-02-01?page=0 ) but that is off, too.

It seems to be a configuration issue, right? Any idea what to search for? What could I ask the admin of www.nerdica.net to change or reconfigure, please?

@annando
Copy link
Collaborator

annando commented Jul 28, 2019

This can be configured in the robots.txt file. See here for details: https://support.google.com/webmasters/answer/6062596

@joysfera
Copy link
Author

joysfera commented Jul 28, 2019

Hm, for comparison - libranet.de (that is indexed properly):

User-Agent: *
Disallow: /
User-Agent: Googlebot
Allow: /
User-Agent: Googlebot-Mobile
Allow: /
User-Agent: Bingbot
Allow: /
User-Agent: DuckDuckBot
Allow: /
User-Agent: yacybot
Allow: /
User-Agent: Archive.org_bot
Allow: /

and nerdica.net (that seems to be indexed improperly yet still some pages are in search engines' archives):

User-agent: *
Disallow: /settings/
Disallow: /admin/
Disallow: /message/
Disallow: /search
Disallow: /help
Disallow: /proxy

So libranet.de invites few good crawlers by "allowing" them to index everything while nerdica.net lists a bunch of paths that are not to be indexed and doesn't say anything about the rest of the web.

Since full access is the assumption and the explicit Allow thus can be omitted, the libranet's inviting robots.txt should not be different than nerdica.net's one. So if it cause of bad indexing is indeed the robots.txt then one of the paths that are listed on nerdica.net as disallowed is crucial for the proper message-ID indexing. However, none of the paths listed above seems to be relevant to message indexing, to me anyway.

If you disagree and feel like one of the paths could be causing search engine's indexing issues please tell me.

@annando
Copy link
Collaborator

annando commented Jul 28, 2019

I'm not a real expert in this stuff. And I must confess that I'm working more on the opposite, means: Rejecting access for search crawlers at all. This has the background that with article 17 of the copyright directive in the EU the responsibility for copyright violations had been changed. So we should do everything to avoid that copyrighted material is shared - but also that it cannot be found.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants