Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New outputKey() and keepInputData() step methods #64

Merged
merged 2 commits into from
Dec 21, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
2 changes: 2 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,12 +10,14 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
* New functionality to paginate: There is the new `Paginate` child class of the `Http` step class (easy access via `Http::get()->paginate()`). It takes an instance of the `PaginatorInterface` and uses it to iterate through pagination links. There is one implementation of that interface, the `SimpleWebsitePaginator`. The `Http::get()->paginate()` method uses it by default, when called just with a CSS selector to get pagination links. Paginators receive all loaded pages and implement the logic to find pagination links. The paginator class is also called before sending a request, with the request object that is about to be sent as an argument (`prepareRequest()`). This way, it should even be doable to implement more complex pagination functionality. For example when pagination is built using POST request with query strings in the request body.
* New methods `stopOnErrorResponse()` and `yieldErrorResponses()` that can be used with `Http` steps. By calling `stopOnErrorResponse()` the step will throw a `LoadingException` when a response has a 4xx or 5xx status code. By calling the `yieldErrorResponse()` even error responses will be yielded and passed on to the next steps (this was default behaviour until this version. See the breaking change below).
* The body of HTTP responses with a `Content-Type` header containing `application/x-gzip` are automatically decoded when `Http::getBodyString()` is used. Therefor added `ext-zlib` to suggested in `composer.json`.
* New methods `outputKey()` and `keepInputData()` that can be used with any step. Using the `outputKey()` method, the step will convert non array output to an array and use the key provided as an argument to this method as array key for the output value. The `keepInputData()` method allows you to forward data from the step's input to the output. If the input is non array you can define a key using the method's argument. This is useful e.g. if you're having data in the initial inputs that you also want to add to the final crawling results.
* The `FileCache` class can compress the cache data now to save disk space. Use the `useCompression()` method to do so.
* New method `retryCachedErrorResponses()` in `HttpLoader`. When called, the loader will only use successful responses (status code < 400) from the cache and therefore retry already cached error responses.
* New method `writeOnlyCache()` in `HttpLoader` to only write to, but don't read from the response cache. Can be used to renew cached responses.
* `Filter::urlPathMatches()` to filter URL paths using a regex.

### Changed
* __BREAKING__: Group steps can now only produce combined outputs, as previously done when `combineToSingleOutput()` method was called. The method is removed.
* __BREAKING__: Error responses (4xx as well as 5xx), by default, won't produce any step outputs any longer. If you want to receive error responses, use the new `yieldErrorResponses()` method.
* __BREAKING__: Removed the `httpClient()` method in the `HttpCrawler` class. If you want to provide your own HTTP client, implement a custom `loader` method passing your client to the `HttpLoader` instead.
* In case of a 429 (Too Many Requests) response, the `HttpLoader` now automatically waits and retries. By default, it retries twice and waits 10 seconds for the first retry and a minute for the second one. In case the response also contains a `Retry-After` header with a value in seconds, it complies to that. Exception: by default it waits at max `60` seconds (you can set your own limit if you want), if the `Retry-After` value is higher, it will stop crawling. If all the retries also receive a `429` it also throws an Exception.
Expand Down
16 changes: 8 additions & 8 deletions src/Crawler.php
Original file line number Diff line number Diff line change
Expand Up @@ -205,21 +205,21 @@ private function invokeStepsRecursive(Input $input, StepInterface $step, int $st
{
$outputs = $step->invokeStep($input);

if ($step->cascades() && $this->nextStep($stepIndex)) {
$nextStep = $this->nextStep($stepIndex);

if ($step->cascades() && $nextStep) {
foreach ($outputs as $output) {
if ($this->monitorMemoryUsage !== false) {
$this->logMemoryUsage();
}

$this->outputHook?->call($this, $output, $stepIndex, $step);

if ($this->nextStep($stepIndex)) {
yield from $this->invokeStepsRecursive(
new Input($output),
$this->nextStep($stepIndex),
$stepIndex + 1
);
}
yield from $this->invokeStepsRecursive(
new Input($output),
$nextStep,
$stepIndex + 1
);
}
} elseif ($step->cascades()) {
if ($this->outputHook) {
Expand Down
56 changes: 56 additions & 0 deletions src/Steps/BaseStep.php
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,12 @@ abstract class BaseStep implements StepInterface
*/
protected array $filters = [];

protected bool $keepInputData = false;

protected ?string $keepInputDataKey = null;

protected ?string $outputKey = null;

/**
* @param Input $input
* @return Generator<Output>
Expand Down Expand Up @@ -174,6 +180,26 @@ final public function orWhere(string|FilterInterface $keyOrFilter, ?FilterInterf
return $this;
}

public function outputKey(string $key): static
{
$this->outputKey = $key;

return $this;
}

/**
* @param string|null $inputKey
* @return $this
*/
public function keepInputData(?string $inputKey = null): static
{
$this->keepInputData = true;

$this->keepInputDataKey = $inputKey;

return $this;
}

public function resetAfterRun(): void
{
$this->uniqueOutputKeys = $this->uniqueInputKeys = [];
Expand Down Expand Up @@ -258,6 +284,36 @@ final protected function passesAllFilters(mixed $output): bool
return true;
}

/**
* @return array<string, mixed>
* @throws Exception
*/
protected function addInputDataToOutputData(mixed $inputValue, mixed $outputValue): array
{
if (!is_array($outputValue)) {
throw new Exception(
'Can\'t add input data to non array output data! You can use the outputKey() method ' .
'to make the step\'s output an array.'
);
}

if (!is_array($inputValue)) {
if (!is_string($this->keepInputDataKey)) {
throw new Exception('No key defined for scalar input value.');
}

$inputValue = [$this->keepInputDataKey => $inputValue];
}

foreach ($inputValue as $key => $value) {
if (!isset($outputValue[$key])) {
$outputValue[$key] = $value;
}
}

return $outputValue;
}

/**
* @param mixed[] $output
*/
Expand Down
68 changes: 20 additions & 48 deletions src/Steps/Group.php
Original file line number Diff line number Diff line change
Expand Up @@ -20,8 +20,6 @@ final class Group extends BaseStep

private ?LoaderInterface $loader = null;

private bool $combine = false;

/**
* @param Input $input
* @return Generator<Output>
Expand All @@ -43,7 +41,7 @@ public function invokeStep(Input $input): Generator
$input = $step->callUpdateInputUsingOutput($input, $output);
}

if ($this->combine && $step->cascades()) {
if ($this->cascades() && $step->cascades()) {
$stepKey = $step->getResultKey() ?? $key;

$combinedOutput = $this->addOutputToCombinedOutputs(
Expand All @@ -52,52 +50,13 @@ public function invokeStep(Input $input): Generator
$stepKey,
$nthOutput,
);
} elseif ($this->cascades() && $step->cascades()) {
if ($this->uniqueOutput !== false && !$this->inputOrOutputIsUnique($output)) {
continue;
}

if ($this->passesAllFilters($output)) {
yield $output;
}
}
}
}

if ($this->combine && $this->cascades()) {
yield from $this->prepareCombinedOutputs($combinedOutput, $input->result);
}
}

public function combineToSingleOutput(): self
{
$this->combine = true;

return $this;
}

/**
* @throws Exception
*/
public function setResultKey(string $key): static
{
if (!$this->combine) {
throw new Exception('Groups can only add data to results when output is combined to a single output.');
}

return parent::setResultKey($key);
}

/**
* @throws Exception
*/
public function addKeysToResult(?array $keys = null): static
{
if (!$this->combine) {
throw new Exception('Groups can only add data to results when output is combined to a single output.');
if ($this->cascades()) {
yield from $this->prepareCombinedOutputs($combinedOutput, $input);
}

return parent::addKeysToResult($keys);
}

public function addsToOrCreatesResult(): bool
Expand Down Expand Up @@ -187,7 +146,7 @@ private function prepareInput(Input $input): ?Input
*/
private function addResultToInputIfAnyResultKeysDefined(Input $input): Input
{
if ($this->combine && $this->addsToOrCreatesResult() && !$input->result) {
if ($this->addsToOrCreatesResult() && !$input->result) {
$input = new Input($input->get(), new Result());
}

Expand Down Expand Up @@ -221,18 +180,31 @@ private function addOutputToCombinedOutputs(

/**
* @param mixed[] $combinedOutputs
* @param Result|null $result
* @param Input $input
* @return Generator<Output>
* @throws Exception
*/
private function prepareCombinedOutputs(array $combinedOutputs, ?Result $result = null): Generator
private function prepareCombinedOutputs(array $combinedOutputs, Input $input): Generator
{
$result = $input->result;

foreach ($combinedOutputs as $combinedOutput) {
$outputData = $this->normalizeCombinedOutputs($combinedOutput);

if ($this->passesAllFilters($outputData)) {
if ($this->keepInputData === true) {
$outputData = $this->addInputDataToOutputData($input->get(), $outputData);
}

$output = new Output($outputData, $result);

if ($this->uniqueOutput !== false && !$this->inputOrOutputIsUnique($output)) {
continue;
}

$this->addOutputDataToResult($outputData, $result);

yield new Output($outputData, $result);
yield $output;
}
}
}
Expand Down
2 changes: 1 addition & 1 deletion src/Steps/Loading/Http.php
Original file line number Diff line number Diff line change
Expand Up @@ -146,7 +146,7 @@ public function yieldErrorResponses(): static
/**
* @throws InvalidArgumentException
*/
protected function validateAndSanitizeInput(mixed $input): UriInterface
protected function validateAndSanitizeInput(mixed $input): mixed
{
return $this->validateAndSanitizeToUriInterface($input);
}
Expand Down
14 changes: 14 additions & 0 deletions src/Steps/Loop.php
Original file line number Diff line number Diff line change
Expand Up @@ -186,6 +186,20 @@ public function orWhere(string|FilterInterface $keyOrFilter, ?FilterInterface $f
return $this;
}

public function outputKey(string $key): static
{
$this->step->outputKey($key);

return $this;
}

public function keepInputData(?string $inputKey = null): static
{
$this->step->keepInputData($inputKey);

return $this;
}

/**
* Callback that is called in a step group to adapt the input for further steps
*
Expand Down
8 changes: 8 additions & 0 deletions src/Steps/Step.php
Original file line number Diff line number Diff line change
Expand Up @@ -179,6 +179,14 @@ private function invokeAndYield(mixed $validInputValue, ?Result $result): Genera
continue;
}

if (!is_array($output) && $this->outputKey) {
$output = [$this->outputKey => $output];
}

if ($this->keepInputData === true) {
$output = $this->addInputDataToOutputData($validInputValue, $output);
}

$output = $this->output($output, $result);

if ($this->uniqueOutput && !$this->inputOrOutputIsUnique($output)) {
Expand Down
4 changes: 4 additions & 0 deletions src/Steps/StepInterface.php
Original file line number Diff line number Diff line change
Expand Up @@ -43,5 +43,9 @@ public function where(string|FilterInterface $keyOrFilter, ?FilterInterface $fil

public function orWhere(string|FilterInterface $keyOrFilter, ?FilterInterface $filter = null): static;

public function outputKey(string $key): static;

public function keepInputData(?string $inputKey = null): static;

public function resetAfterRun(): void;
}