Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement a search indexer abstraction #730

Merged
merged 22 commits into from
Oct 23, 2019

Conversation

Toflar
Copy link
Member

@Toflar Toflar commented Sep 6, 2019

This PR implements a search indexer abstraction level so that one can have additional search indexers. The core one can also be disabled completely by configuring

contao:
    search:
        default_indexer:
            enabled: false

Here are some key concepts:

  • There's a general, simple IndexerInterface now. I've implemented a DelegatingIndexer that just forwards to all registered indexers so we can have multiple indexers.
  • A Document represents an URI, response status code, headers and the body. For meta data I chose to use application/ld+json scripts because they are designed for exactly this use case (also see schema.org).
  • I've removed the $GLOBALS['TL_NOINDEX_KEYS'] because they are just plain nonsense. If you don't want the page to be indexed when these paramters are set, you have to configure the page to have a <meta name="robots" content="noindex"> tag. Otherwise neither any real search engine nor my planned indexer will have a chance to find out what you want to do. Also, why would you not want to index pages with a page parameter present? Maybe that's just fine for some cases.

All unit tests etc. are already done. So this PR is in a final state to be reviewed 😊

@leofeyer leofeyer added this to the 4.9 milestone Sep 6, 2019
Copy link
Member

@leofeyer leofeyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good overall.

I guess the TL_NOINDEX_KEYS exists to prevent flooding the search index with duplicate or irrelevant search entries. We should preserve the functionality, although I agree that we should generate a noindex tag in this case.

Copy link
Member

@ausi ausi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think dropping TL_NOINDEX_KEYS could lead to problems. Especially calender parameters like day, month and year could result in a big number of duplicate entries for the same page.

Once we have a better way to detect such “duplicates” we can remove TL_NOINDEX_KEYS IMO.

core-bundle/src/Resources/contao/library/Contao/Config.php Outdated Show resolved Hide resolved
core-bundle/src/Resources/contao/pages/PageRegular.php Outdated Show resolved Hide resolved
core-bundle/src/Search/Document.php Show resolved Hide resolved
@Toflar
Copy link
Member Author

Toflar commented Sep 9, 2019

All comments addressed. Ready for another round of reviews.
I've restored the TL_NOINDEX_KEYS feature in a7af57e although the hardcoded page_ comparison hurt my eyes so I had to replace it by a regular expression :)

Copy link
Member

@ausi ausi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🎉

core-bundle/src/Resources/contao/classes/Frontend.php Outdated Show resolved Hide resolved
core-bundle/src/Resources/contao/pages/PageRegular.php Outdated Show resolved Hide resolved
@Toflar Toflar changed the title [RFC] Implemented search indexer abstraction [RTM] Implemented search indexer abstraction Sep 10, 2019
@Toflar
Copy link
Member Author

Toflar commented Sep 10, 2019

Apart from a rebase to master once Symfony deps are raised, this is RTM 🎉

@Toflar Toflar force-pushed the feature/search-indexer-abstraction branch from 8f8c3de to aa70f92 Compare October 16, 2019 13:56
@Toflar
Copy link
Member Author

Toflar commented Oct 16, 2019

Merged latest master into this PR and adjusted the configuration section acordingly. Should be all ready to merge now :)

Copy link
Member

@leofeyer leofeyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good job! Only the service IDs seem a little inconsistent to me (see my comments).

core-bundle/src/DependencyInjection/Configuration.php Outdated Show resolved Hide resolved
core-bundle/src/Resources/config/services.yml Outdated Show resolved Hide resolved
core-bundle/src/Resources/config/services.yml Outdated Show resolved Hide resolved
@leofeyer
Copy link
Member

Do we really need the Indexer sub-namespace?

namespace Contao\CoreBundle\Search\Indexer;

class DefaultIndexer
{
}

Will there be a large number of different indexer classes? And what other sub-namespaces will we have in the future?

@Toflar
Copy link
Member Author

Toflar commented Oct 23, 2019

I don't know. Maybe an ElasticSearchIndexer. An AlgolicaSearchIndexer?

@leofeyer leofeyer force-pushed the feature/search-indexer-abstraction branch from dcc0864 to fba751c Compare October 23, 2019 14:54
@leofeyer leofeyer force-pushed the feature/search-indexer-abstraction branch from 1b0269f to 1d73d45 Compare October 23, 2019 16:05
@leofeyer leofeyer merged commit 27fe686 into master Oct 23, 2019
@leofeyer
Copy link
Member

Thank you very much @Toflar.

@leofeyer leofeyer deleted the feature/search-indexer-abstraction branch October 23, 2019 16:07
leofeyer pushed a commit that referenced this pull request Nov 8, 2019
Description
-----------

This is a follow-up PR for #730.
I've introduced a search indexer abstraction there but clearing a single URI was not part of it.
I've extended the interface which now also contains a `clearDocument()` method (no BC break, 4.9 is still in development) and I've extracted the logic into its own listener so that it doesn't just work for our own `PageError404` but for any exception.

Commits
-------

44fd9aa Delete invalid URLs using the new search indexer abstraction
550141c Fixed comment
a2eb4d7 Only handle responses that contained any JSON LD data
1195a3f Fix the coding style
leofeyer pushed a commit to contao/core-bundle that referenced this pull request Nov 8, 2019
Description
-----------

This is a follow-up PR for contao/contao#730.
I've introduced a search indexer abstraction there but clearing a single URI was not part of it.
I've extended the interface which now also contains a `clearDocument()` method (no BC break, 4.9 is still in development) and I've extracted the logic into its own listener so that it doesn't just work for our own `PageError404` but for any exception.

Commits
-------

44fd9aa7 Delete invalid URLs using the new search indexer abstraction
550141ca Fixed comment
a2eb4d76 Only handle responses that contained any JSON LD data
1195a3ff Fix the coding style
@leofeyer leofeyer changed the title [RTM] Implemented search indexer abstraction Implement a search indexer abstraction Dec 23, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants