Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crawler stops between 95 and 98%. #1356

Closed
christianbarkowsky opened this issue Feb 19, 2020 · 25 comments
Closed

Crawler stops between 95 and 98%. #1356

christianbarkowsky opened this issue Feb 19, 2020 · 25 comments
Assignees
Labels
Milestone

Comments

@christianbarkowsky
Copy link
Contributor

christianbarkowsky commented Feb 19, 2020

Affected version(s)

4.9

Description

On my website the new crawler stops between 95 and 98%.
The following error message can be found in the log files.

[2020-02-19 09:02:23] request.CRITICAL: Uncaught PHP Exception InvalidArgumentException: "Unable to parse URI: http://" at /www/htdocs/vendor/nyholm/psr7/src/Uri.php line 51 {"exception":"[object] (InvalidArgumentException(code: 0): Unable to parse URI: http:// at /www/htdocs//vendor/nyholm/psr7/src/Uri.php:51)"} []

How to reproduce

I can show you my installation.

@Toflar
Copy link
Member

Toflar commented Feb 19, 2020

Can you paste the complete stack trace?

@christianbarkowsky
Copy link
Contributor Author

christianbarkowsky commented Feb 19, 2020

Uncaught` PHP Exception InvalidArgumentException: "Unable to parse URI: http://" at /www/htdocs/vendor/nyholm/psr7/src/Uri.php line 51
Hide context    Hide trace

[▼
  "exception" => InvalidArgumentException {#2097 ▼
    #message: "Unable to parse URI: http://"
    #code: 0
    #file: "/www/htdocs/vendor/nyholm/psr7/src/Uri.php"
    #line: 51
    trace: {▼
    /www/htdocs/vendor/nyholm/psr7/src/Uri.php:51 {▶}
      /www/htdocs/vendor/terminal42/escargot/src/Subscriber/HtmlCrawlerSubscriber.php:55 {▼
        Terminal42\Escargot\Subscriber\HtmlCrawlerSubscriber->onLastChunk(CrawlUri $crawlUri, ResponseInterface $response, ChunkInterface $chunk): void …
        › $link = new Link($node, (string) $crawlUri->getUri()->withPath('')->withQuery('')->withFragment(''));
        › $uri = new Uri($link->getUri());
        › 
        arguments: {▶}
      }
      /www/htdocs/vendor/terminal42/escargot/src/Escargot.php:449 {▼
        Terminal42\Escargot\Escargot->processResponseChunk(ResponseInterface $response, ChunkInterface $chunk): void …
        › if (SubscriberInterface::DECISION_NEGATIVE !== $needsContentDecision) {
                ›     $subscriber->onLastChunk($crawlUri, $response, $chunk);
        › }
        arguments: {▼
          $crawlUri: Terminal42\Escargot\CrawlUri {#829 …}
                    $response: Symfony\Component\HttpClient\Response\CurlResponse {#734 …}
                        $chunk: Symfony\Component\HttpClient\Chunk\LastChunk {#744 …}
                        }
      }
      /www/htdocs/vendor/terminal42/escargot/src/Escargot.php:407 {▼
        Terminal42\Escargot\Escargot->processResponses(array $responses): void …
        › foreach ($this->getClient()->stream($responses) as $response => $chunk) {
                            ›     $this->processResponseChunk($response, $chunk);
        › }
        arguments: {▼
          $response: Symfony\Component\HttpClient\Response\CurlResponse {#734 …}
                                $chunk: Symfony\Component\HttpClient\Chunk\LastChunk {#744 …}
                                }
      }
      /www/htdocs/vendor/terminal42/escargot/src/Escargot.php:315 {▼
        Terminal42\Escargot\Escargot->crawl(): void …
        › 
        ›     $this->processResponses($responses);
        › }
        arguments: {▼
          $responses: [ …5]
        }
      }
      /www/htdocs/vendor/contao/core-bundle/src/Resources/contao/classes/Crawl.php:171 {▼
        Contao\Crawl->run() …
        › // Start crawling
        › $escargot->crawl();
        › 
      }
      /www/htdocs/vendor/contao/core-bundle/src/Resources/contao/modules/ModuleMaintenance.php:49 {▼
        Contao\ModuleMaintenance->compile() …
        › 
        › $buffer = $this->$callback->run();
        › 
      }
      /www/htdocs/vendor/contao/core-bundle/src/Resources/contao/classes/BackendModule.php:92 {▼
        Contao\BackendModule->generate() …
        › $this->Template = new BackendTemplate($this->strTemplate);
        › $this->compile();
        › 
      }
      /www/htdocs/vendor/contao/core-bundle/src/Resources/contao/classes/Backend.php:434 {▼
        Contao\Backend->getBackendModule($module, PickerInterface $picker = null) …
        › 
        › \t$this->Template->main .= $objCallback->generate();
        › }
      }
      /www/htdocs/vendor/contao/core-bundle/src/Resources/contao/controllers/BackendMain.php:155 {▼
        Contao\BackendMain->run() …
        › 
        › $this->Template->main .= $this->getBackendModule(Input::get('do'), $picker);
        › $this->Template->title = $this->Template->headline;
        arguments: {▼
          $module: "maintenance"
          $picker: null
        }
      }
      /www/htdocs/vendor/contao/core-bundle/src/Controller/BackendController.php:48 {▼
        Contao\CoreBundle\Controller\BackendController->mainAction(): Response …
        › 
        ›     return $controller->run();
        › }
      }
      /www/htdocs/vendor/symfony/http-kernel/HttpKernel.php:146 {▼
        Symfony\Component\HttpKernel\HttpKernel->handleRaw(Request $request, int $type = self::MASTER_REQUEST): Response …
        › // call controller
        › $response = $controller(...$arguments);
        › 
      }
      /www/htdocs/vendor/symfony/http-kernel/HttpKernel.php:68 {▼
        Symfony\Component\HttpKernel\HttpKernel->handle(Request $request, $type = HttpKernelInterface::MASTER_REQUEST, $catch = true) …
        › try {
                        ›     return $this->handleRaw($request, $type);
        › } catch (\Exception $e) {
                        arguments: {▼
          $request: Symfony\Component\HttpFoundation\Request {#6 …}
                                $type: 1
        }
      }
                        /www/htdocs/vendor/symfony/http-kernel/Kernel.php:201 {▼
        Symfony\Component\HttpKernel\Kernel->handle(Request $request, $type = HttpKernelInterface::MASTER_REQUEST, $catch = true) …
        › try {
                                ›     return $this->getHttpKernel()->handle($request, $type, $catch);
        › } finally {
                                arguments: {▼
          $request: Symfony\Component\HttpFoundation\Request {#6 …}
                                        $type: 1
          $catch: true
        }
      }
                                /www/htdocs/web/index.php:31 {▼
        require …
        › 
        › $response = $kernel->handle($request);
        › $response->send();
        arguments: {▶}
      }
      /www/htdocs/web/app.php:4 {▼
        › // Backwards compatibility
        › require __DIR__.'/index.php';
        › 
        arguments: {▼
          "/www/htdocs/web/index.php"
        }
      }
    }
  }
]

{▼
/www/htdocs/vendor/nyholm/psr7/src/Uri.php:51 {▼
    Nyholm\Psr7\Uri->__construct(string $uri = '') …
    › if (false === $parts = \parse_url($uri)) {
        ›     throw new \InvalidArgumentException("Unable to parse URI: $uri");
    › }
  }
  /www/htdocs/vendor/terminal42/escargot/src/Subscriber/HtmlCrawlerSubscriber.php:55 {▼
    Terminal42\Escargot\Subscriber\HtmlCrawlerSubscriber->onLastChunk(CrawlUri $crawlUri, ResponseInterface $response, ChunkInterface $chunk): void …
    › $link = new Link($node, (string) $crawlUri->getUri()->withPath('')->withQuery('')->withFragment(''));
    › $uri = new Uri($link->getUri());
    › 
    arguments: {▼
      $uri: "http://"
    }
  }
  /www/htdocs/vendor/terminal42/escargot/src/Escargot.php:449 {▼
    Terminal42\Escargot\Escargot->processResponseChunk(ResponseInterface $response, ChunkInterface $chunk): void …
    › if (SubscriberInterface::DECISION_NEGATIVE !== $needsContentDecision) {
        ›     $subscriber->onLastChunk($crawlUri, $response, $chunk);
    › }
    arguments: {▼
      $crawlUri: Terminal42\Escargot\CrawlUri {#829 …}
            $response: Symfony\Component\HttpClient\Response\CurlResponse {#734 …}
                $chunk: Symfony\Component\HttpClient\Chunk\LastChunk {#744 …}
                }
  }
  /www/htdocs/vendor/terminal42/escargot/src/Escargot.php:407 {▼
    Terminal42\Escargot\Escargot->processResponses(array $responses): void …
    › foreach ($this->getClient()->stream($responses) as $response => $chunk) {
                    ›     $this->processResponseChunk($response, $chunk);
    › }
    arguments: {▼
      $response: Symfony\Component\HttpClient\Response\CurlResponse {#734 …}
                        $chunk: Symfony\Component\HttpClient\Chunk\LastChunk {#744 …}
                        }
  }
  /www/htdocs/vendor/terminal42/escargot/src/Escargot.php:315 {▼
    Terminal42\Escargot\Escargot->crawl(): void …
    › 
    ›     $this->processResponses($responses);
    › }
    arguments: {▼
      $responses: [ …5]
    }
  }
  /www/htdocs/vendor/contao/core-bundle/src/Resources/contao/classes/Crawl.php:171 {▼
    Contao\Crawl->run() …
    › // Start crawling
    › $escargot->crawl();
    › 
  }
  /www/htdocs/vendor/contao/core-bundle/src/Resources/contao/modules/ModuleMaintenance.php:49 {▼
    Contao\ModuleMaintenance->compile() …
    › 
    › $buffer = $this->$callback->run();
    › 
  }
  /www/htdocs/vendor/contao/core-bundle/src/Resources/contao/classes/BackendModule.php:92 {▼
    Contao\BackendModule->generate() …
    › $this->Template = new BackendTemplate($this->strTemplate);
    › $this->compile();
    › 
  }
  /www/htdocs/vendor/contao/core-bundle/src/Resources/contao/classes/Backend.php:434 {▼
    Contao\Backend->getBackendModule($module, PickerInterface $picker = null) …
    › 
    › \t$this->Template->main .= $objCallback->generate();
    › }
  }
  /www/htdocs/vendor/contao/core-bundle/src/Resources/contao/controllers/BackendMain.php:155 {▼
    Contao\BackendMain->run() …
    › 
    › $this->Template->main .= $this->getBackendModule(Input::get('do'), $picker);
    › $this->Template->title = $this->Template->headline;
    arguments: {▼
      $module: "maintenance"
      $picker: null
    }
  }
  /www/htdocs/vendor/contao/core-bundle/src/Controller/BackendController.php:48 {▼
    Contao\CoreBundle\Controller\BackendController->mainAction(): Response …
    › 
    ›     return $controller->run();
    › }
  }
  /www/htdocs/vendor/symfony/http-kernel/HttpKernel.php:146 {▼
    Symfony\Component\HttpKernel\HttpKernel->handleRaw(Request $request, int $type = self::MASTER_REQUEST): Response …
    › // call controller
    › $response = $controller(...$arguments);
    › 
  }
  /www/htdocs/vendor/symfony/http-kernel/HttpKernel.php:68 {▼
    Symfony\Component\HttpKernel\HttpKernel->handle(Request $request, $type = HttpKernelInterface::MASTER_REQUEST, $catch = true) …
    › try {
                ›     return $this->handleRaw($request, $type);
    › } catch (\Exception $e) {
                arguments: {▼
      $request: Symfony\Component\HttpFoundation\Request {#6 …}
                        $type: 1
    }
  }
                /www/htdocs/vendor/symfony/http-kernel/Kernel.php:201 {▼
    Symfony\Component\HttpKernel\Kernel->handle(Request $request, $type = HttpKernelInterface::MASTER_REQUEST, $catch = true) …
    › try {
                        ›     return $this->getHttpKernel()->handle($request, $type, $catch);
    › } finally {
                        arguments: {▼
      $request: Symfony\Component\HttpFoundation\Request {#6 …}
                                $type: 1
      $catch: true
    }
  }
                        /www/htdocs/web/index.php:31 {▼
    require …
    › 
    › $response = $kernel->handle($request);
    › $response->send();
    arguments: {▼
      $request: Symfony\Component\HttpFoundation\Request {#6 …}
                                }
  }
  /www/htdocs/web/app.php:4 {▶}

@Toflar
Copy link
Member

Toflar commented Feb 20, 2020

You have an empty <a href="http://"></a> link somewhere on your page which is invalid. Can you try to update all dependencies so terminal42/escargot is updated to 0.5.2. It should be fixed with terminal42/escargot@3b16bad. The debug log should then also tell you where that link was found and that it couldn't be added to the queue :)

@Toflar Toflar added this to the 4.9 milestone Feb 20, 2020
@Toflar Toflar added the bug label Feb 20, 2020
@Toflar Toflar self-assigned this Feb 20, 2020
@christianbarkowsky
Copy link
Contributor Author

After the update comes a new error.

[2020-02-20 17:19:55] request.CRITICAL: Uncaught PHP Exception Doctrine\DBAL\Exception\DriverException: "An exception occurred while executing 'SELECT uri, level, processed, found_on, tags FROM tl_crawl_queue WHERE (job_id = ?) AND (processed = ?) ORDER BY id ASC LIMIT 1 OFFSET 508' with params ["c423a14e-233d-48a6-b291-155429e27422", 0]: SQLSTATE[HY000]: General error: 2006 MySQL server has gone away" at /www/htdocs/vendor/doctrine/dbal/lib/Doctrine/DBAL/Driver/AbstractMySQLDriver.php line 106 {"exception":"[object] (Doctrine\\DBAL\\Exception\\DriverException(code: 0): An exception occurred while executing 'SELECT uri, level, processed, found_on, tags FROM tl_crawl_queue WHERE (job_id = ?) AND (processed = ?) ORDER BY id ASC LIMIT 1 OFFSET 508' with params [\"c423a14e-233d-48a6-b291-155429e27422\", 0]:\n\nSQLSTATE[HY000]: General error: 2006 MySQL server has gone away at /www/htdocs/vendor/doctrine/dbal/lib/Doctrine/DBAL/Driver/AbstractMySQLDriver.php:106, Doctrine\\DBAL\\Driver\\PDOException(code: HY000): SQLSTATE[HY000]: General error: 2006 MySQL server has gone away at /www/htdocs/vendor/doctrine/dbal/lib/Doctrine/DBAL/Driver/PDOStatement.php:123, PDOException(code: HY000): SQLSTATE[HY000]: General error: 2006 MySQL server has gone away at /www/htdocs/vendor/doctrine/dbal/lib/Doctrine/DBAL/Driver/PDOStatement.php:121)"} []

@Toflar
Copy link
Member

Toflar commented Feb 20, 2020

General error: 2006 MySQL server has gone away"

...

@christianbarkowsky
Copy link
Contributor Author

Where should the MySQL server be going? ... the page is running ... the crawler rotates at 96%. 🎃

@Toflar
Copy link
Member

Toflar commented Feb 20, 2020

I can't help you anymore here without having a copy of the whole setup, sorry. The MySQL server shuts down, you have to debug on your own why that happens. Maybe there is some endless loop, maybe not.

@christianbarkowsky
Copy link
Contributor Author

On the command line, the crawler runs without problems.

@christianbarkowsky
Copy link
Contributor Author

On command line the crawler crawls web-freelancer-gesucht.de, but i never link to this page.

Bildschirmfoto 2020-02-20 um 18 01 03

@christianbarkowsky
Copy link
Contributor Author

Now I have found the link. 💡
The link is behind a name in a comment.

https://brkwsky.de/blog-leser/diese-drei-tools-erleichtern-unseren-arbeitsalltag

<p class="info">Kommentar von <a href="http://www.web-freelancer-gesucht.de" target="_blank" rel="nofollow noreferrer noopener">Mark</a

nofollow noreferrer noopener?!

@Toflar
Copy link
Member

Toflar commented Feb 24, 2020

nofollow should be ignored. Check the debug logs.

@LIVID-Media
Copy link

I have a similar problem. Only links that contain rel="nofollow" are correctly tagged with "rel-nofollow" in the table "tl_crawl_queue". With rel="nofollow noopener" it doesn't work.

@Toflar
Copy link
Member

Toflar commented Feb 24, 2020

Indeed. Fixed in terminal42/escargot@f01decb and released as 0.5.3. Update your dependencies so you get the latest terminal42/escargot version and this issue should be gone.

@LIVID-Media
Copy link

Great... thanks for the fast fix!

@christianbarkowsky
Copy link
Contributor Author

Sorry. 0.5.3 does not fix my problem. :(

Bildschirmfoto 2020-02-24 um 15 35 18

@Toflar
Copy link
Member

Toflar commented Feb 24, 2020

I fixed the problem @LIVID-Media was mentioning. I cannot fix your problem until I have proper instructions on how I can reproduce the issue.

@christianbarkowsky
Copy link
Contributor Author

i can share my screen? @Toflar

@Toflar
Copy link
Member

Toflar commented Feb 25, 2020

I couldn't spot any issues. The crawler finishes correctly and crawls through all the data.
The progress bar advances normally. The title (in your screenshot http://www.web-freelancer-gesucht.de/) is just not updated for all the requests. It only happens for all the requests that actually were fully completed (which is not the case for broken link checks).
So it might be that the title does not update for quite a while.
I think we should remove that completely as that seems to confuse people and it provides no added value anyway?

@christianbarkowsky
Copy link
Contributor Author

Ok, that's a good idea.

@Toflar
Copy link
Member

Toflar commented Feb 25, 2020

PR is here: #1396
I've also found an issue and released Escargot 0.5.5 which should immensely reduce the number of URLs that are checked and speed up things quite a lot 😄
Please provide feedback.

@christianbarkowsky
Copy link
Contributor Author

Tadaaaa, now it works and the crawler was lightning fast. ❤️

@Toflar Toflar closed this as completed Feb 25, 2020
leofeyer pushed a commit that referenced this issue Feb 26, 2020
…see #1396)

Description
-----------

It was confusing and it does not provide any added value.
Also, it just slows down CLI rendering.

Also see #1356.

Commits
-------

74a2e10 Do not show the current URI in progress bar title on crawl command
leofeyer pushed a commit to contao/core-bundle that referenced this issue Feb 26, 2020
…see #1396)

Description
-----------

It was confusing and it does not provide any added value.
Also, it just slows down CLI rendering.

Also see contao/contao#1356.

Commits
-------

74a2e108 Do not show the current URI in progress bar title on crawl command
@rorych
Copy link
Contributor

rorych commented Apr 7, 2020

@Toflar Can it be, that the newest version detects console-commands as GET Requests but has no URI, because we are on the console? I'm getting a similar error on all cronjob-commands, since i updated from Contao 4.9.1 to 4.9.2 with all dependencies.

One example command would be:
vendor/terminal42/notification_center/bin/queue -s 2 -n 10

Error:
Fatal error: Uncaught InvalidArgumentException: Unable to parse URI: http://:/ in /home/linderim/public_html/contao49/vendor/nyholm/psr7/src/Uri.php:51 Stack trace: #0 /home/linderim/public_html/contao49/vendor/contao/core-bundle/src/Search/Document.php(133): Nyholm\Psr7\Uri->__construct('http://:/') #1 /home/linderim/public_html/contao49/vendor/contao/core-bundle/src/EventListener/SearchIndexListener.php(68): Contao\CoreBundle\Search\Document::createFromRequestResponse(Object(Symfony\Component\HttpFoundation\Request), Object(Symfony\Component\HttpFoundation\Response)) #2 /home/linderim/public_html/contao49/vendor/symfony/event-dispatcher/EventDispatcher.php(304): Contao\CoreBundle\EventListener\SearchIndexListener->__invoke(Object(Symfony\Component\HttpKernel\Event\TerminateEvent), 'kernel.terminat...', Object(Symfony\Component\EventDispatcher\EventDispatcher)) #3 /home/linderim/public_html/contao49/vendor/symfony/event-dispatcher/EventDispatcher.php(264): Symfony\Component\EventDispatcher\EventDispatcher::Symfony\Compon in /home/linderim/public_html/contao49/vendor/nyholm/psr7/src/Uri.php on line 51

As a Hotfix i added this:
if($request->getUri() == 'http://:/'){ return; }
to /vendor/contao/core-bundle/src/EventListener/SearchIndexListener.php just before the Call to createFromRequestResponse() on Line 67.

Is this a local problem in my page or something caused by a logic-error in the SearchIndexListener?
In the database i could not find an empty href-attribute.

Thanks in advance for your help

@Toflar
Copy link
Member

Toflar commented Apr 7, 2020

That doesn't look like any issue of the crawler no. But it looks like the BC layer of the initialize.php always expects a web request which is not true in the case of the notification center (or probably also many other scripts). /cc @aschempp

@aschempp
Copy link
Member

aschempp commented Apr 7, 2020

See #1637

@ricowe
Copy link

ricowe commented Aug 25, 2020

Is there an update or fix for this? Everytime i update Contao, I get this error every minute (by cron) until I insert the hotfix by @rorych .

PHP Fatal error: Uncaught InvalidArgumentException: Unable to parse URI: http://:/ in /var/www/vhosts/mydomain/httpdocs/vendor/nyholm/psr7/src/Uri.php:53 Stack trace: #0 /var/www/vhosts/mydomain/httpdocs/vendor/contao/core-bundle/src/Search/Document.php(133): Nyholm\Psr7\Uri->__construct() #1 /var/www/vhosts/mydomain/httpdocs/vendor/contao/core-bundle/src/EventListener/SearchIndexListener.php(74): Contao\CoreBundle\Search\Document::createFromRequestResponse() #2 /var/www/vhosts/mydomain/httpdocs/vendor/symfony/event-dispatcher/EventDispatcher.php(304): Contao\CoreBundle\EventListener\SearchIndexListener->__invoke() #3 /var/www/vhosts/mydomain/httpdocs/vendor/symfony/event-dispatcher/EventDispatcher.php(264): Symfony\Component\EventDispatcher\EventDispatcher::Symfony\Component\EventDispatcher\{closure}() #4 /var/www/vhosts/mydomain/httpdocs/vendor/symfony/event-dispatcher/EventDispatcher.php(239): Symfony\Component\EventDispatcher\EventDispatch in /var/www/vhosts/mydomain/httpdocs/vendor/nyholm/psr7/src/Uri.php on line 53

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Mar 18, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

6 participants