Skip to content

eliashaeussler/cache-warmup

Repository files navigation

Screenshot

Cache warmup

Coverage Maintainability CGL Tests Supported PHP Versions

A PHP library to warm up caches of URLs located in XML sitemaps. Cache warmup is performed by concurrently sending simple HEAD requests to those URLs, either from the command-line or by using the provided PHP API. The whole warmup process is highly customizable, e.g. by defining a crawling limit, excluding sitemaps and URLs by exclusion patterns or by using a specific crawling strategy. It is even possible to write custom crawlers that take care of cache warmup.

πŸš€ Features

  • Warm up caches of URLs located in XML sitemaps
  • Console command and PHP API for cache warmup
  • Support for sitemap indexes
  • Exclusion patterns for sitemaps and URLs
  • Various crawling strategies
  • Support for gzipped XML sitemaps
  • Interface for custom crawler implementations

πŸ”₯ Installation

PHAR (recommended)

PHAR PHAR

Head over to the latest GitHub release and download the cache-warmup.phar file.

Run chmod +x cache-warmup.phar to make it executable.

PHIVE

PHIVE

phive install cache-warmup

Docker

Docker Docker Pulls

docker run --rm -it eliashaeussler/cache-warmup

You can also use the image from GitHub Container Registry:

docker run --rm -it ghcr.io/eliashaeussler/cache-warmup

Composer

Packagist Packagist Downloads

composer require eliashaeussler/cache-warmup

⚑ Usage

Command-line usage

$ cache-warmup [options] [<sitemaps>...]

The following input parameters are available:

Input parameter Description
sitemaps URLs or local filenames of XML sitemaps to be warmed up
--allow-failures Allow failures during URL crawling and exit with zero
--crawler-options, -o JSON-encoded string of additional config for configurable crawlers
--crawler, -c FQCN of the crawler to use for cache warmup
--exclude, -e Patterns of URLs to be excluded from cache warmup
--format, -f Formatter used to print the cache warmup result
--limit, -l Limit the number of URLs to be processed
--log-file File where to log crawling results
--log-level Log level used to determine which crawling results to log
--progress, -p Show a progress bar during cache warmup
--repeat-after Run cache warmup in endless loop and repeat x seconds after each run
--stop-on-failure Cancel further cache warmup requests on failure
--strategy, -s Optional crawling strategy to prepare URLs before crawling them
--urls, -u Additional URLs to be warmed up

πŸ’‘ Run cache-warmup --help to see a detailed explanation of all available input parameters.

sitemaps

URLs or local filenames of XML sitemaps to be warmed up.

$ cache-warmup "https://www.example.org/sitemap.xml" "/var/www/html/sitemap.xml"
Shorthand –
Required βœ… (either sitemaps or --urls must be provided)
Multiple values allowed βœ…
Default –

--urls

Additional URLs to be warmed up.

$ cache-warmup -u "https://www.example.org/" -u "https://www.example.org/de/"
Shorthand -u
Required βœ… (either sitemaps or --urls must be provided)
Multiple values allowed βœ…
Default –

--exclude

Patterns of URLs to be excluded from cache warmup.

The following patterns can be configured:

  • Regular expressions with delimiter #, e.g. #(no_cache|no_warming)=1#
  • Patterns for use with fnmatch, e.g. *no_cache=1*
$ cache-warmup -e "#(no_cache|no_warming)=1#" -e "*no_cache=1*"
Shorthand -e
Required –
Multiple values allowed βœ…
Default –

--limit

Limit the number of URLs to be processed.

$ cache-warmup --limit 250
Shorthand -l
Required –
Multiple values allowed –
Default 0 (= no limit)

--progress

Show a progress bar during cache warmup.

Note

You can show a more verbose progress bar by increasing output verbosity with the --verbose command option.

Important

The progress bar is implicitly enabled when using a non-verbose formatter, e.g. json.

$ cache-warmup --progress
Shorthand -p
Required –
Multiple values allowed –
Default no

--crawler

FQCN of the crawler to use for cache warmup.

The crawler must implement one the following interfaces:

In addition, there exist more interfaces with advanced crawling features:

$ cache-warmup --crawler "Vendor\Crawler\MyCustomCrawler"
Shorthand -c
Required –
Multiple values allowed –
Default –1)

1) The default crawler depends on whether the command option --progress is set. In this case the OutputtingCrawler is used, otherwise the ConcurrentCrawler.

--crawler-options

A JSON-encoded string of additional config for configurable crawlers.

Important

These options only apply to crawlers implementing EliasHaeussler\CacheWarmup\Crawler\ConfigurableCrawlerInterface. If the configured crawler does not implement this interface, a warning is shown in case crawler options are configured.

$ cache-warmup --crawler-options '{"concurrency": 3, "request_options": {"delay": 3000}}'
Shorthand -o
Required –
Multiple values allowed –
Default null

--strategy

Optional crawling strategy to prepare URLs before crawling them.

The following strategies are currently available:

$ cache-warmup --strategy sort-by-priority
Shorthand -s
Required –
Multiple values allowed –
Default – (= crawl by occurrence)

--format

The formatter used to print the cache warmup result. Read more in the Formatters section below.

At the moment, the following formatters are available:

$ cache-warmup --format json
Shorthand -f
Required βœ…
Multiple values allowed –
Default text

--log-file

A file where to log crawling results. Implicitly enables logging, if this option is set.

$ cache-warmup --log-file cache-warmup.log
Shorthand –
Required –
Multiple values allowed –
Default –

--log-level

The log level used to determine which crawling results to log. This option is only respected if the --log-file command option is set.

The following log levels are available:

  • emergency
  • alert
  • critical
  • error (default; enables logging of failed crawls)
  • warning
  • notice
  • info (enables logging of successful crawls)
  • debug
$ cache-warmup --log-level error
Shorthand –
Required –
Multiple values allowed –
Default error (= log only failed crawls)

--allow-failures

Allow failures during URL crawling and exit with zero.

$ cache-warmup --allow-failures
Shorthand –
Required –
Multiple values allowed –
Default no

--stop-on-failure

Cancel further cache warmup requests on failure.

$ cache-warmup --stop-on-failure
Shorthand –
Required –
Multiple values allowed –
Default no

--repeat-after

Run cache warmup in endless loop and repeat x seconds after each run.

$ cache-warmup --repeat-after 300

Important

If cache warmup fails, the command fails immediately and is not repeated. To continue in case of failures, the --allow-failures command option must be passed as well.

Shorthand –
Required –
Multiple values allowed –
Default 0 (= run only once)

Code usage

// Instantiate and run cache warmer
$cacheWarmer = new \EliasHaeussler\CacheWarmup\CacheWarmer();
$cacheWarmer->addSitemaps('https://www.example.org/sitemap.xml');
$result = $cacheWarmer->run();

// Get successful and failed URLs
$successfulUrls = $result->getSuccessful();
$failedUrls = $result->getFailed();

πŸ“‚ Configuration

Crawler configuration

Both default crawlers are implemented as configurable crawlers:

The following configuration can be passed either on command-line as JSON-encoded string (see --crawler-options command option) or as associative array in the constructor when using the library with PHP.

concurrency

Define how many URLs are crawled concurrently.

Note

Internally, Guzzle's Pool feature is used to send multiple requests concurrently using asynchronous requests. You may also have a look at how this is implemented in the library's RequestPoolFactory.

$crawler = new \EliasHaeussler\CacheWarmup\Crawler\ConcurrentCrawler([
    'concurrency' => 3,
]);
Type integer
Default 5

request_method

The HTTP method used to perform cache warmup requests.

$crawler = new \EliasHaeussler\CacheWarmup\Crawler\ConcurrentCrawler([
    'request_method' => 'GET',
]);
Type string
Default HEAD

request_headers

A list of HTTP headers to send with each cache warmup request.

$crawler = new \EliasHaeussler\CacheWarmup\Crawler\ConcurrentCrawler([
    'request_headers' => [
        'X-Foo' => 'bar',
        'User-Agent' => 'Foo-Crawler/1.0',
    ],
]);
Type array<string, string>
Mergeable βœ…
Default ['User-Agent' => '<default user-agent>']2)

2)The default user-agent is built in ConcurrentCrawlerTrait::getRequestHeaders().

request_options

Additional request options used for each cache warmup request.

$crawler = new \EliasHaeussler\CacheWarmup\Crawler\ConcurrentCrawler([
    'request_options' => [
        'delay' => 500,
        'timeout' => 10,
    ],
]);
Type array<string, mixed>
Mergeable –
Default []

client_config

Optional configuration used when instantiating a new Guzzle client.

Important

Client configuration is ignored when running cache warmup from the command-line. In addition, it is only respected if a new client is instantiated within the crawler. If an existing client is passed to the crawler, client configuration is ignored.

$stack = \GuzzleHttp\HandlerStack::create();
$stack->push($customMiddleware);

$crawlerConfig = [
    'client_config' => [
        'handler' => $stack,
    ],
];

// βœ… This works as expected
$crawler = new \EliasHaeussler\CacheWarmup\Crawler\ConcurrentCrawler(
    $crawlerConfig,
);

// ❌ This does *not* respect client configuration
$crawler = new \EliasHaeussler\CacheWarmup\Crawler\ConcurrentCrawler(
    $crawlerConfig,
    new \GuzzleHttp\Client(),
);
Type array<string, mixed>
Mergeable –
Default []

Formatters

Text formatter

This is the default formatter that is used if no other formatter is explicitly configured. It writes all user-oriented output to the console.

JSON formatter

This formatter can be used to format user-oriented output as JSON object. It includes the following properties:

Property Description
cacheWarmupResult Lists all crawled URLs, grouped by their crawling state (failure, success), and may contain cancelled state
messages Contains all logged messages, grouped by message severity (error, info, success, warning)
parserResult Lists all parsed and excluded XML sitemaps and URLs, grouped by their parsing state (excluded, failure, success)
time Lists all tracked times during cache warmup (crawl, parse)

πŸ’‘ The complete JSON structure can be found in the provided JSON schema.

πŸ§‘β€πŸ’» Contributing

Please have a look at CONTRIBUTING.md.

⭐ License

This project is licensed under GNU General Public License 3.0 (or later).