Skip to content

Commit

Permalink
Delete search index entries under more specific conditions (see #7266)
Browse files Browse the repository at this point in the history
Description
-----------

In a Contao instance of a customer where indexing of protected pages is enabled I noticed the following issue: if the URL of protected and already indexed page is requested without a valid login for that page, the URL gets deleted from the search index.

To fix this I first thought of ignoring the `401` and `403` status code in our `SearchIndexListener` for the delete operation. However, then I realized that there are other status codes where the same holds true: a `503` status code is only temporary and any URL responding momentarily with that status code should not be removed from the index (neither would Google, they would only remove the URL from the index if that status code persists over a longer period of time).

Then I realized the same holds true for the `500` status code. Any error happening under a specific URL might only be temporary and thus the URL should not be removed from the index (neither would Google, they would only remove the URL from the index if that status code persists over a longer period of time).

Thus I then decided to completely revamp the conditions under which a URL should be deleted from the index:

* If the status code is `404` or `410`, always delete from the index.
* If the response is succesful and the `X-Robots-Tag` contains `noindex`, always delete from the index.
* If the response is succesful and the HTML contains the meta robots tag with `noindex`, always delete from the index.
* Otherwise never delete from the index automatically, as any other state might just be temporary for any given URL.

Commits
-------

3392dfe revamp SearchIndexListener
018340e update wording
3126738 add comment
9ca90d6 Apply suggestions from code review
74ea635 adjust test and add comment
41d2dc7 remove unused argument
377fe5b Apply suggestions from code review
e8287f5 remove remaining trailing space
  • Loading branch information
fritzmg committed Jun 5, 2024
1 parent 3a73c28 commit 49df784
Show file tree
Hide file tree
Showing 2 changed files with 83 additions and 13 deletions.
44 changes: 40 additions & 4 deletions core-bundle/src/EventListener/SearchIndexListener.php
Original file line number Diff line number Diff line change
Expand Up @@ -59,15 +59,14 @@ public function __invoke(TerminateEvent $event): void

$document = Document::createFromRequestResponse($request, $response);
$needsIndex = $this->needsIndex($request, $response, $document);
$needsDelete = $this->needsDelete($response, $document);

try {
$success = $event->getResponse()->isSuccessful();

if ($needsIndex && $success && $this->enabledFeatures & self::FEATURE_INDEX) {
if ($needsIndex && $this->enabledFeatures & self::FEATURE_INDEX) {
$this->indexer->index($document);
}

if (!$success && $this->enabledFeatures & self::FEATURE_DELETE) {
if ($needsDelete && $this->enabledFeatures & self::FEATURE_DELETE) {
$this->indexer->delete($document);
}
} catch (IndexerException $e) {
Expand All @@ -77,6 +76,11 @@ public function __invoke(TerminateEvent $event): void

private function needsIndex(Request $request, Response $response, Document $document): bool
{
// Do not index if response was not successful
if (!$response->isSuccessful()) {
return false;
}

// Do not index if called by crawler
if (Factory::USER_AGENT === $request->headers->get('User-Agent')) {
return false;
Expand Down Expand Up @@ -108,4 +112,36 @@ private function needsIndex(Request $request, Response $response, Document $docu
// If there are no json ld scripts at all, this should not be handled by our indexer
return 0 !== \count($lds);
}

private function needsDelete(Response $response, Document $document): bool
{
// Always delete on 404 and 410 responses
if (\in_array($response->getStatusCode(), [Response::HTTP_NOT_FOUND, Response::HTTP_GONE], true)) {
return true;
}

// Do not delete if the response was not successful
if (!$response->isSuccessful()) {
return false;
}

// Delete if the X-Robots-Tag header contains "noindex"
if (false !== strpos($response->headers->get('X-Robots-Tag', ''), 'noindex')) {
return true;
}

try {
$robots = $document->getContentCrawler()->filterXPath('//head/meta[@name="robots"]')->first()->attr('content');

// Delete if the meta robots tag contains "noindex"
if (false !== strpos($robots, 'noindex')) {
return true;
}
} catch (\Exception $e) {
// No meta robots tag found
}

// Otherwise do not delete
return false;
}
}
52 changes: 43 additions & 9 deletions core-bundle/tests/EventListener/SearchIndexListenerTest.php
Original file line number Diff line number Diff line change
Expand Up @@ -99,47 +99,81 @@ public function getRequestResponse(): \Generator
false,
];

yield 'Should be deleted because the response was not successful (404)' => [
yield 'Should be deleted because the response was "not found" (404)' => [
Request::create('/foobar'),
new Response('', 404),
SearchIndexListener::FEATURE_DELETE | SearchIndexListener::FEATURE_INDEX,
false,
true,
];

yield 'Should be deleted because the response was not successful (403)' => [
yield 'Should be deleted because the response was "gone" (410)' => [
Request::create('/foobar'),
new Response('', 403),
new Response('', 404),
SearchIndexListener::FEATURE_DELETE | SearchIndexListener::FEATURE_INDEX,
false,
true,
];

yield 'Should not be deleted because even though the response was not successful (403), it was disabled by the feature flag ' => [
yield 'Should not be deleted because even though the response was "not found" (404), it was disabled by the feature flag' => [
Request::create('/foobar'),
new Response('<html><body><script type="application/ld+json">{"@context":"https:\/\/contao.org\/","@type":"Page","pageId":2,"noSearch":false,"protected":false,"groups":[],"fePreview":false}</script></body></html>', 403),
new Response('<html><body><script type="application/ld+json">{"@context":"https:\/\/contao.org\/","@type":"Page","pageId":2,"noSearch":false,"protected":false,"groups":[],"fePreview":false}</script></body></html>', 404),
SearchIndexListener::FEATURE_INDEX,
false,
false,
];

$response = new Response('<html><body><script type="application/ld+json">{"@context":"https:\/\/contao.org\/","@type":"Page","pageId":2,"noSearch":false,"protected":false,"groups":[],"fePreview":false}</script></body></html>', 403);
$response = new Response('<html><body><script type="application/ld+json">{"@context":"https:\/\/contao.org\/","@type":"Page","pageId":2,"noSearch":false,"protected":false,"groups":[],"fePreview":false}</script></body></html>', 200);
$response->headers->set('X-Robots-Tag', 'noindex');

yield 'Should not be handled because the X-Robots-Tag header contains "noindex" ' => [
yield 'Should not index but should delete because the X-Robots-Tag header contains "noindex"' => [
Request::create('/foobar'),
$response,
SearchIndexListener::FEATURE_DELETE | SearchIndexListener::FEATURE_INDEX,
false,
true,
];

yield 'Should not be handled because the meta robots tag contains "noindex" ' => [
$response = new Response('<html><body><script type="application/ld+json">{"@context":"https:\/\/contao.org\/","@type":"Page","pageId":2,"noSearch":false,"protected":false,"groups":[],"fePreview":false}</script></body></html>', 500);
$response->headers->set('X-Robots-Tag', 'noindex');

yield 'Should not index and delete because the X-Robots-Tag header contains "noindex" and response is unsuccessful' => [
Request::create('/foobar'),
new Response('<html><head><meta name="robots" content="noindex,nofollow"/></head><body><script type="application/ld+json">{"@context":"https:\/\/contao.org\/","@type":"Page","pageId":2,"noSearch":false,"protected":false,"groups":[],"fePreview":false}</script></body></html>', 403),
$response,
SearchIndexListener::FEATURE_DELETE | SearchIndexListener::FEATURE_INDEX,
false,
false,
];

yield 'Should not index but should delete because the meta robots tag contains "noindex"' => [
Request::create('/foobar'),
new Response('<html><head><meta name="robots" content="noindex,nofollow"/></head><body><script type="application/ld+json">{"@context":"https:\/\/contao.org\/","@type":"Page","pageId":2,"noSearch":false,"protected":false,"groups":[],"fePreview":false}</script></body></html>', 200),
SearchIndexListener::FEATURE_DELETE | SearchIndexListener::FEATURE_INDEX,
false,
true,
];

yield 'Should not index and delete because the meta robots tag contains "noindex" and response is unsuccessful' => [
Request::create('/foobar'),
new Response('<html><head><meta name="robots" content="noindex,nofollow"/></head><body><script type="application/ld+json">{"@context":"https:\/\/contao.org\/","@type":"Page","pageId":2,"noSearch":false,"protected":false,"groups":[],"fePreview":false}</script></body></html>', 500),
SearchIndexListener::FEATURE_DELETE | SearchIndexListener::FEATURE_INDEX,
false,
false,
];

// From the unsuccessful responses only the 404 and 410 status codes should execute a deletion.
for ($status = 400; $status < 600; ++$status) {
if (\in_array($status, [Response::HTTP_NOT_FOUND, Response::HTTP_GONE], true)) {
continue;
}

yield 'Should be skipped because the response status ('.$status.') is not successful and not a "not found" or "gone" response' => [
Request::create('/foobar'),
new Response('', $status),
SearchIndexListener::FEATURE_DELETE | SearchIndexListener::FEATURE_INDEX,
false,
false,
];
}
}
}

0 comments on commit 49df784

Please sign in to comment.