Skip to content

Commit

Permalink
Improve performance of SessionHandler::getSpiderID()
Browse files Browse the repository at this point in the history
99f2805 already optimized this to avoid the
need of calling ->getSpiderID() for logged-in users, but guest sessions still
call ->getSpiderID() on every request to look up the legacy session.

This commit massively improves the performance of ->getSpiderID() for all
cases, but especially for requests where no spider can be matched. The latter
previously required a full O(n) search across the spider list and thus was the
worst case situation. This worst case situation likely happened for the vast
majority of guest requests. But even cases where a spider can be matched will
benefit from this.

The improvements are achieved by two things:

1. The size of the cache that needs to be read and unserialized is reduced from
87k to 17k.
2. Instead of searching linearly through the list of spiders, needing to
implicitly call ->__get() twice for each, the matching is performed by an
optimized regular expression that effectively implements a prefix tree. If this
regular expression matches, then the spiderID will be efficiently looked up in
an array that is keyed by the matched string.

Numbers for 10,000 calls to ->getSpiderID() on my computer running PHP 8.1:

- Google Bot: From 0.44s down to 0.14s.
- Firefox 98: From 1.05s down to 0.07s.
  • Loading branch information
TimWolla committed Mar 1, 2022
1 parent d2f8e72 commit 96dced2
Show file tree
Hide file tree
Showing 2 changed files with 36 additions and 6 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,38 @@ public function rebuild(array $parameters)
$spiderList->sqlOrderBy = "spider.spiderID ASC";
$spiderList->readObjects();

if (isset($parameters['fastLookup'])) {
$firstCharacter = [];
$mapping = [];
foreach ($spiderList as $spider) {
if (!isset($firstCharacter[$spider->spiderIdentifier[0]])) {
$firstCharacter[$spider->spiderIdentifier[0]] = [];
}
$firstCharacter[$spider->spiderIdentifier[0]][] = \substr($spider->spiderIdentifier, 1);

$mapping[$spider->spiderIdentifier] = $spider->spiderID;
}

$regex = '';
foreach ($firstCharacter as $char => $spiders) {
if ($regex !== '') {
$regex .= '|';
}
$regex .= \sprintf(
'(?:%s(?:%s))',
\preg_quote($char, '/'),
\implode('|', \array_map(static function ($identifier) {
return \preg_quote($identifier, '/');
}, $spiders))
);
}

return [
'regex' => "/{$regex}/",
'mapping' => $mapping,
];
}

return $spiderList->getObjects();
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -1410,16 +1410,14 @@ public static function resetSessions(array $userIDs = [])
*/
protected function getSpiderID(string $userAgent): ?int
{
$spiderList = SpiderCacheBuilder::getInstance()->getData();
$data = SpiderCacheBuilder::getInstance()->getData(['fastLookup' => true]);
$userAgent = \strtolower($userAgent);

foreach ($spiderList as $spider) {
if (\strpos($userAgent, $spider->spiderIdentifier) !== false) {
return \intval($spider->spiderID);
}
if (!\preg_match($data['regex'], $userAgent, $matches)) {
return null;
}

return null;
return $data['mapping'][$matches[0]];
}

/**
Expand Down

0 comments on commit 96dced2

Please sign in to comment.