Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use a custom schema for the search indexing meta data #1250

Merged
merged 1 commit into from Feb 17, 2020

Conversation

@Toflar
Copy link
Member

Toflar commented Jan 24, 2020

Draft PR to fix #1236.

@Toflar Toflar force-pushed the bugfix/invalid-search-jsonld-context branch from 29bc1bb to 77bbc83 Jan 24, 2020
Copy link
Contributor

aschempp left a comment

I'm very unsure whether it's a good idea to try to use schema.org for our internal data structure. The mapping is very limited and more "guessing" than actually knowing which meta properties we should use.
I would assume the issue with #1236 is not that we do use our own schema, but that the provided link does actually not point to a valid schema definition? So maybe we have to come up with JSON-LD schema information for our content? Which actually makes sense, so there's a "documentation" about the features?

$groups = array_map('intval', array_filter((array) $objPage->groups));
if (!empty($groups))
{
$audience[] = array
(
'@type' => 'PeopleAudience',
'name' => 'groups',
'identifier' => $groups,
);
}
Comment on lines 855 to 864

This comment has been minimized.

Copy link
@aschempp

aschempp Jan 24, 2020

Contributor

this should only be applied if the page is protected

if ($token !== null && $token->getUser() instanceof FrontendUser)
{
$meta['memberId'] = (int) $token->getUser()->id;
$audience[] = array
(
'@type' => 'PeopleAudience',
'name' => 'member',
'identifier' => (int) $token->getUser()->id,
);
}
Comment on lines 869 to 877

This comment has been minimized.

Copy link
@aschempp

aschempp Jan 24, 2020

Contributor

Why do we override the group permissions with user permissions? There can never be group permissions then, because there's always a user logged in when viewing that information? I know that has been the case before (and this PR is only to rewrite the existing), but maybe you can explain?

@leofeyer

This comment has been minimized.

Copy link
Member

leofeyer commented Jan 24, 2020

I agree that we should use our own schema, so that search engines do not try to process our JSON-LD data.

@Toflar

This comment has been minimized.

Copy link
Member Author

Toflar commented Feb 4, 2020

I don't think we should use our own schema. It's actually a big plus that other search engines can understand even more about our pages. Why wouldn't we want to have that? We already add quite a few schema.org tags using HTML, why not enhance it using JSON-LD?

@leofeyer leofeyer added the defect label Feb 4, 2020
@leofeyer leofeyer added this to the 4.9 milestone Feb 4, 2020
@leofeyer

This comment has been minimized.

Copy link
Member

leofeyer commented Feb 4, 2020

Good point. I guess as long as the schema validates, it does not hurt to use the existing one.

@aschempp

This comment has been minimized.

Copy link
Contributor

aschempp commented Feb 7, 2020

I've investigated the topic, and using our own schema is the only sensible approach to this.

Our own schema will allow us (someone reading Contao pages) to actually get meaningful values. Only if we use our own schema, a crawler can know with certainty that the current page is a Contao page. If it's a generic schema.org/WebPage, noone knows if version = preview is actually the preview version of a page (that must not be indexed). Another schema.org/WebPage could use preview for any other reason.

Using our own schema(s) actually gives much broader features: Our crawler could index the page, and it could (separately) index the news item. Search results could then actually show search results for news on the current website separately from the page search results. News can have a picture, and show up accordingly etc. all thanks to respective schema attributes.

To provide meaningful information to external crawlers, there are two options:

  1. Add additional schema information for schema.org/WebPage. Like generating similar JSON twice. That would be easy, and no problem for performance, as GZIP compresses duplicates anyway. I'm not sure if its' useful though, because we already have a lot of meta attributes that essentially generate the same information.
  2. Add a property to our own schema, that implements schema.org/WebPage. Our JSON would then basically contain a sub-key with the schema.org/WebPage stuff. I'm not sure if any parser (looking for a @type WebPage) would understand this, but it would certainly be valid JSON-LD.

At the end of the day, using schema.org/WebPage does not provide much benefit, because that information is already present on the page (using meta tags or itemscope/itemtype). Adding our own schema will allow our own crawler (or one understanding Contao) to read meaningful information.

@ausi

This comment has been minimized.

Copy link
Member

ausi commented Feb 7, 2020

I agree with @aschempp

I looked at the definitions of potentialAction, conditionsOfAccess and audience. I think they don’t fit our use case and would potentially break something if someone wants to use one of these properties for their intended use.

For example someone wants to add schema tags to the search form and adds the SearchAction as a potentialAction to the WebPage (see schema.org/SearchAction Example 3). This would result in a conflict with our data.

IMO we should reuse the existing data that is already there for external crawlers:

  • language => <html lang="…">
  • title => <title>…</title>

And put the data that can only be meaningful to an internal crawler seperatly:

  • noSearch => This would be a <meta content="noindex"> but AFAIK we do want to differentiate between public searchable and private searchable. So we cannot use the data for external crawlers here.
  • protected => There might be a standardized way to tell crawlers about protected pages, but conditionsOfAccess doesn’t seem right here IMO.
  • groups => internal
  • fePreview => internal
  • pageId => internal
  • memberId => internal

The internal data doesn’t have to be JSON-LD IMO, because it’s internal and cannot be used by someone without knowing about Contao.

@aschempp

This comment has been minimized.

Copy link
Contributor

aschempp commented Feb 10, 2020

The internal data doesn’t have to be JSON-LD IMO, because it’s internal and cannot be used by someone without knowing about Contao.

It should be JSON-LD, because Escargot / our crawler can understand it. That's the point of JSON-LD, to present machine-readable data to things crawling the page. But we need our own schema so one can know this is exactly what the data is. @type: http://schema.org/WebPage tells that name: foobar is the name of a webpage. Our type and props need to tell about our data.

@leofeyer

This comment has been minimized.

Copy link
Member

leofeyer commented Feb 13, 2020

As discussed in Mumble on February 13th, we want to have our own schema on schema.contao.org and retrieve the data as suggested by @ausi above.

@Toflar

This comment has been minimized.

Copy link
Member Author

Toflar commented Feb 13, 2020

After looking into this, our discussed solution doesn't work so I had to mark it up for discussion again.

It seems like https://search.google.com/structured-data/testing-tool is not really a proper JSON LD validator but rather just validates schema.org and Google specific JSON LD data.
You can validate that by just copying the very first example of https://json-ld.org/ into the testing tool. It will tell you that There was an invalid type in your JSON-LD. which is not true, it's a perfectly valid type, it's just one Google doesn't know and doesn't care about.

So it seems like we cannot and don't have to get rid of this error.

…lements for the rest
@Toflar Toflar force-pushed the bugfix/invalid-search-jsonld-context branch from 77bbc83 to acc3669 Feb 13, 2020
@Toflar Toflar marked this pull request as ready for review Feb 13, 2020
@Toflar

This comment has been minimized.

Copy link
Member Author

Toflar commented Feb 13, 2020

I've still updated the schema URL so that we could (or still can) introduce the correct schema and adjusted the current behaviour to read the language and title from HTML rather than the JSON-LD context as suggested by @ausi.

@leofeyer

This comment has been minimized.

Copy link
Member

leofeyer commented Feb 13, 2020

I also ran some tests and the following JSON+LD is valid according to the Google test tool:

{
  "@context": "https://schema.org",
  "@type": "Page",
  "pageId": 10,
  "noSearch": false,
  "protected": false,
  "groups": [],
  "fePreview": false
}

I'm pretty sure that this is just a coincidence, but as we will be using an "invalid" schema anyway, why not use one that does not trigger validation errors?

@Toflar

This comment has been minimized.

Copy link
Member Author

Toflar commented Feb 13, 2020

But it‘s incorrect. Just because it validates, doesn‘t mean it‘s correct. Imho what Google does is wrong.

@leofeyer

This comment has been minimized.

Copy link
Member

leofeyer commented Feb 14, 2020

Yes, I agree.

What about using good old <meta> tags instead of JSON-LD?

<meta name="page_id" content="10">
<meta name="no_search" content="">
<meta name="protected" content="">
<meta name="groups" content="">
<meta name="fe_preview" content="">

One advantage would be that we could entirely omit tags without content.

@Toflar

This comment has been minimized.

Copy link
Member Author

Toflar commented Feb 14, 2020

I don't understand what you are trying to achieve, really. JSON-LD was invented for exactly the use case we're having here. We don't need to bend over backwards just to please any tool that validates incorrectly.

@ausi
ausi approved these changes Feb 15, 2020
@leofeyer leofeyer merged commit b59bd53 into 4.9 Feb 17, 2020
9 checks passed
9 checks passed
Coverage
Details
Coding Style
Details
PHP 7.2
Details
PHP 7.3
Details
PHP 7.4
Details
Prefer Lowest
Details
Bundles
Details
Windows
Details
codecov/project Absolute coverage decreased by -0.31% but relative coverage increased by +9.84% compared to a5feefc
Details
@leofeyer leofeyer deleted the bugfix/invalid-search-jsonld-context branch Feb 17, 2020
@leofeyer

This comment has been minimized.

Copy link
Member

leofeyer commented Feb 17, 2020

Thank you @Toflar.

@leofeyer leofeyer changed the title Use correct schema.org data for search indexing meta data Use a custom schema for the search indexing meta data Feb 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

4 participants
You can’t perform that action at this time.