Script doesn't exit when using RedisUrlList #7

simoncpu · 2017-10-09T15:56:54Z

Problem: Script doesn't exit when using RedisUrList.

Step to replicate:
Run the following code:

'use strict';

const supercrawler = require('supercrawler');
const crawler = new supercrawler.Crawler({
    urlList: new supercrawler.RedisUrlList({
        redis: {
            host: 'redis-server.example.org'
        }
    })
});

console.log('Script should exit after this.');

Expected behavior:
Script should stop after running.

Actual behavior:
Script runs indefinitely.

Workaround:
Call process.exit() to terminate the script.

BTW, I'm using AWS ElastiCache for Redis, just in case this detail is needed. :)

The text was updated successfully, but these errors were encountered:

brendonboshell · 2017-10-09T16:10:41Z

If I understand correctly, this is by design. The process will wait until further URLs are available for crawling. There could be no URLs in the queue for three reasons:

(a) The queue is empty, in which case it waits until a URL is added. In a distributed set-up, this could be added by another script/tool.

(b) The queue only has URLs that are errored and waiting for a retry. Failed URLs are tried using exponential backoff.

(c) Even successful URLs will be recrawled after 30 days (configurable with expiryTimeMs).

Since the crawl will never end, I would expect the process to continue.

You can listen to the urllistempty event to detect when the queue is empty and call crawler.stop(). This should stop the script once the currently-crawled URLs are finished.

brendonboshell · 2017-10-09T16:14:43Z

On second thoughts, if you have not called start(), it should exit. This is probably because we do not disconnect from redis. I will take a look at this.

simoncpu · 2017-10-09T16:29:20Z

Yepp, I've tried disabling keepAlive in ioredis and calling crawl.stop() without calling crawl.start(), and the script still doesn't exit. Our use case is for a separate script (preferably in AWS Lambda*) to listen for new URLs and push them to supercrawler via crawler.getUrlList().insertIfNotExists().

* I initially assumed that this timed out in Lambda due to Bluebird, but turns out to be caused by Redis.

simoncpu · 2017-10-09T16:32:25Z

Ah... process.exit() has also been recommended by the guys at ioredis. I guess we'll just use this workaround. :)

scottburton11 mentioned this issue Oct 11, 2019

How to periodically crawl again #40

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Script doesn't exit when using RedisUrlList #7

Script doesn't exit when using RedisUrlList #7

simoncpu commented Oct 9, 2017

brendonboshell commented Oct 9, 2017

brendonboshell commented Oct 9, 2017

simoncpu commented Oct 9, 2017

simoncpu commented Oct 9, 2017

Script doesn't exit when using RedisUrlList #7

Script doesn't exit when using RedisUrlList #7

Comments

simoncpu commented Oct 9, 2017

brendonboshell commented Oct 9, 2017

brendonboshell commented Oct 9, 2017

simoncpu commented Oct 9, 2017

simoncpu commented Oct 9, 2017