Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Script doesn't exit when using RedisUrlList #7

Open
simoncpu opened this issue Oct 9, 2017 · 4 comments
Open

Script doesn't exit when using RedisUrlList #7

simoncpu opened this issue Oct 9, 2017 · 4 comments

Comments

@simoncpu
Copy link
Contributor

simoncpu commented Oct 9, 2017

Problem: Script doesn't exit when using RedisUrList.

Step to replicate:
Run the following code:

'use strict';

const supercrawler = require('supercrawler');
const crawler = new supercrawler.Crawler({
    urlList: new supercrawler.RedisUrlList({
        redis: {
            host: 'redis-server.example.org'
        }
    })
});

console.log('Script should exit after this.');

Expected behavior:
Script should stop after running.

Actual behavior:
Script runs indefinitely.

Workaround:
Call process.exit() to terminate the script.

BTW, I'm using AWS ElastiCache for Redis, just in case this detail is needed. :)

@brendonboshell
Copy link
Owner

If I understand correctly, this is by design. The process will wait until further URLs are available for crawling. There could be no URLs in the queue for three reasons:

(a) The queue is empty, in which case it waits until a URL is added. In a distributed set-up, this could be added by another script/tool.

(b) The queue only has URLs that are errored and waiting for a retry. Failed URLs are tried using exponential backoff.

(c) Even successful URLs will be recrawled after 30 days (configurable with expiryTimeMs).

Since the crawl will never end, I would expect the process to continue.

You can listen to the urllistempty event to detect when the queue is empty and call crawler.stop(). This should stop the script once the currently-crawled URLs are finished.

@brendonboshell
Copy link
Owner

On second thoughts, if you have not called start(), it should exit. This is probably because we do not disconnect from redis. I will take a look at this.

@simoncpu
Copy link
Contributor Author

simoncpu commented Oct 9, 2017

Yepp, I've tried disabling keepAlive in ioredis and calling crawl.stop() without calling crawl.start(), and the script still doesn't exit. Our use case is for a separate script (preferably in AWS Lambda*) to listen for new URLs and push them to supercrawler via crawler.getUrlList().insertIfNotExists().

* I initially assumed that this timed out in Lambda due to Bluebird, but turns out to be caused by Redis.

@simoncpu
Copy link
Contributor Author

simoncpu commented Oct 9, 2017

Ah... process.exit() has also been recommended by the guys at ioredis. I guess we'll just use this workaround. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants