Skip to content
This repository has been archived by the owner on Jun 19, 2020. It is now read-only.

refs #803 - Add option to create a new WebPage instance from casper #826

Merged
merged 4 commits into from Feb 6, 2016

Conversation

ilangv
Copy link
Contributor

@ilangv ilangv commented Feb 12, 2014

Currently casperjs does not handle closing a page and creating a new
page object.

If user does a casper.page.close() there is no option to create a new
page object.

With this change we can use a casper.page = casper.newPage()

Currently casperjs does not handle closing a page and creating a new
page object.

If user does a casper.page.close() there is no option to create a new
page object.

With this change we can use a casper.page = casper.newPage()
@n1k0
Copy link
Member

n1k0 commented Feb 12, 2014

Thanks. Could you please add docs and a test please?

@n1k0 n1k0 added this to the 1.1 milestone Feb 12, 2014
@mlb5000
Copy link

mlb5000 commented Feb 12, 2014

Ya, I don't think this is quite all the way there, or I'm possibly using it wrong. I just integrated this into 1.1-beta3 as a test with the code from #803 and get this error:

Error: cannot access member `customHeaders' of deleted QObject

@mlb5000
Copy link

mlb5000 commented Feb 12, 2014

@ilangv Digging a little deeper and looking at casper.start it looks like there is some additional initialization that must happen on the page object after it is created. However, even after bringing everything relevant to page initialization into the newPage function it still doesn't want to cooperate.

@mlb5000
Copy link

mlb5000 commented Feb 12, 2014

@ilangv I take that back, I messed up my test structure (different function prototypes between thenOpen and open), but there are still a couple things required in newPage. I got it to work as I want using:

var links = ['http://www.google.com', 'http://www.yahoo.com'];

var casper = require('casper').create({
    logging: 'error',
    pageSettings: {
        webSecurityEnabled: false
    }
});

casper.start('http://www.amazon.com', function () {
    this.echo(this.getTitle());
});

casper.eachThen(links, function(item) {
    var url = item.data;
    this.open(url).then(function (){
        this.echo(this.getTitle());
    }).then(function () {
        casper.page.close();
        casper.newPage();
    });
});

casper.run(function (){
    this.echo('Done');
    casper.exit();
});

combined with this (obviously a copy/paste, but a simple extract method refactoring should take care of it):

Casper.prototype.newPage = function newPage() {
    "use strict";
    this.checkStarted();
    //copied from casper.start()
    this.page = this.mainPage = createPage(this);
    this.page.settings = utils.mergeObjects(this.page.settings, this.options.pageSettings);
    if (utils.isClipRect(this.options.clipRect)) {
        this.page.clipRect = this.options.clipRect;
    }
    if (utils.isObject(this.options.viewportSize)) {
        this.page.viewportSize = this.options.viewportSize;
    }
    return this.page;
};

@mlb5000
Copy link

mlb5000 commented Feb 12, 2014

Would it also be worth it to check the deleted status of the current WebPage object to either throw an error or call close() on it for the caller? As it stands, calling newPage without first calling close() would result in a memory leak, so we might was well protect from that within casper.

@ilangv
Copy link
Contributor Author

ilangv commented Feb 13, 2014

@mlb5000 👍 Valid point

@ilangv
Copy link
Contributor Author

ilangv commented Feb 13, 2014

I think this should solve it.

Calling a page.close() explicitly should do the job

Casper.prototype.newPage = function newPage() {
        "use strict";
        this.checkStarted();
        //Close the existing page object. Does not harm anything even if close() is called twice
        this.page.close();
        //copied from casper.start()
        this.page = this.mainPage = createPage(this);
        this.page.settings = utils.mergeObjects(this.page.settings, this.options.pageSettings);
        if (utils.isClipRect(this.options.clipRect)) {
        this.page.clipRect = this.options.clipRect;
    }
    if (utils.isObject(this.options.viewportSize)) {
        this.page.viewportSize = this.options.viewportSize;
    }
    return this.page;
};

@ilangv
Copy link
Contributor Author

ilangv commented Feb 13, 2014

@n1k0 Will add tests and update docs shortly. Thanks

@mlb5000
Copy link

mlb5000 commented Feb 14, 2014

Unfortunately something still isn't getting cleaned up. After this patch I'm still getting this error after my eachThen runs for a while, even with newPage called each time around.

QEventDispatcherUNIXPrivate(): Unable to create thread pipe: Too many open files
2014-02-13T21:18:06 [FATAL] QEventDispatcherUNIXPrivate(): Can not continue without a thread pipe
Abort trap: 6

@mlb5000
Copy link

mlb5000 commented Feb 14, 2014

It's worth noting that without this it grows by about 10MB per page load, and grows much slower with it. It still gets to over 2GB memory usage after a few hundred iterations though...

Made change to attach page settings to the newly created WebPage object
and included unit test and documentation
@n1k0
Copy link
Member

n1k0 commented Feb 14, 2014

Looks like an improvement. How about renaming newPage to reset though?

@ilangv
Copy link
Contributor Author

ilangv commented Feb 14, 2014

@n1k0 casper.reset() sounds misleading. Tends me to think of it as a master reset to the entire casper object rather than just the page. Don't you think so?

@ilangv
Copy link
Contributor Author

ilangv commented Feb 14, 2014

@mlb5000 Don't see this happening in my case. I am currently running the crawler code https://github.com/seethroughtrees/casperjs-spider with the above change and all is well so far. Am I missing something?

@mlb5000
Copy link

mlb5000 commented Feb 14, 2014

@n1k0 Are you talking in the sense of calling start again? Start clears all steps and history as well as resetting the page. This would be counterproductive in my use case, and is why I don't use it.

@ilangv That's a little concerning. I'm calling casper.newPage() as the last step of my eachThen callback and still it's definitely still leaking. What other resources within Phantom, aside from the page, could be leaking?

@n1k0
Copy link
Member

n1k0 commented Feb 14, 2014

What other resources within Phantom, aside from the page, could be leaking?

The casper instance stores a bunch of logs, history, resource objects and so on; it might be interesting to check these and provide a way to purge the data they contain.

@stefan--
Copy link

@mlb5000 any update on this issue? I am facing the same issue... thanks

@cptX
Copy link

cptX commented Oct 6, 2014

Hi to all, I have written several scripts with casperjs to scrap webpages the last couple of years. I'm using casperjs in Linux. The problem with the increased need in resources (CPU, RAM) was always a big pain for all these scripts and imagine some of them are running standalone in a vps without my supervision. To manage the increase in ram I was automatically restarting the scripts through bash scripts.

Yesterday I came across this thread and I was amazed by the idea to close my headless page in order to reallocate resources and keep the RAM consumption stable and not rising.
I tested it yesterday by adding the following code after a casper.thenOpen inside a loop of several iterations (target was 1000 iterations).

casper.then(function() {
casper.page.close();
casper.newPage();
this.wait(2000, function() {
});
});

Initially I tested it with no wait and it failed completely. The program was hanging in casper.newPage() for ever. Then I added a wait of 1000ms and it managed to do some iterations (~5) but again it hanged. Then I increased it to 2000ms and it managed to do 80-100 iterations before hanging. I repeated the test with my pc restarted (so all other resources were relatively free) and the script has hanged in the 270th iteration... After a while (~30secs) by it's own started working again.
During all these iterations the RAM is kept relatively stable and increasing only very slowly. The difference is huge compared to the same code without page.close().

The reasons I'm posting here are the following:

  1. I want to say this is a must and we definitely should find a way to close pages at the end of every iteration.
  2. I see some instability on the above code. I don't understand why we need so much time to wait. 2000ms is too much for 1000 iterations. I could live with it if at least was stable. I don't understand why it hangs every now and then.
  3. I want to ask if it's possible to make the casper.newPage() blocking until the new Page is ready. That way we don't need to put an unreliable wait() for it.

@mlb5000
Copy link

mlb5000 commented Oct 6, 2014

@cptX I do the same as you to manage the ever growing memory consumption; manage an external queue of URLs to process and execute Casper from another program in a loop. Unfortunately my own experiments came to the same unstable conclusion as your own so I just decided not to trust it altogether and continue on with looping from an outside program.

@cptX
Copy link

cptX commented Oct 6, 2014

This promise of closing every page is too good to reject it so soon! I definitely want to stay with it! I saw so much memory impact difference that for sure can change everything in my implementations...
If we focus on finding the problems of the instabilities?
Why is the delay necessary? What happens behind? How can we make it blocking until the new page is ready?

@mickaelandrieu
Copy link
Member

For now I use this tip to reduce RAM and CPU used by my own scripts.
You can use event to send a KILL signal to phantomjs:

casper.on('run.complete', function() {
    casper.die('kill phantom', 1);
});

@mlb5000
Copy link

mlb5000 commented Oct 7, 2014

@mickaelandrieu Thanks for the tip, I'll have to try that. I have noticed that if I'm running in a loop for a long time I'll start to accumulate a lot of orphaned PhantomJS instances (from times when something crashes). Maybe this would fix that.

@cptX
Copy link

cptX commented Oct 7, 2014

@mickaelandrieu Is it possible to use your script in a loop and if yes do we have to start a new phantom instance somehow after that? How?

@mickaelandrieu
Copy link
Member

@cptX can you post your ask in mailing list ?

@cptX
Copy link

cptX commented Oct 8, 2014

Hi mickaelandrieu, as I'm new here I don't understand what you mean to post in the mailing list.

@mickaelandrieu
Copy link
Member

@cptX
Copy link

cptX commented Oct 10, 2014

OK I opened an issue here: https://groups.google.com/forum/#!topic/casperjs/h-KyQWYXFfY

@mickaelandrieu
Copy link
Member

@n1k0 can you take a look at this ? I will try to reviews others waiting PR

@diwu1989
Copy link

I just hit the memory issue on a scraper, would definitely love to see this merged in 👍

});

test.assert(casper.started, 'Casper.start() started');

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you remove this unused lines ?

@mickaelandrieu
Copy link
Member

@n1k0 @paazmaya for me it's 👍

@ssmulders
Copy link

Please add this - it just fixed a major bug for me and would love to have it in the main branch 👯 👍

@istr
Copy link
Collaborator

istr commented Jan 17, 2016

Looks ok to me, too. @n1k0 @paazmaya @mickaelandrieu @hexid any reason to still keep this open?

@paazmaya
Copy link
Contributor

paazmaya commented Feb 6, 2016

Let's do this

paazmaya added a commit that referenced this pull request Feb 6, 2016
refs #803 - Add option to create a new WebPage instance from casper
@paazmaya paazmaya merged commit 8d561c2 into casperjs:master Feb 6, 2016
@ilangv
Copy link
Contributor Author

ilangv commented Feb 6, 2016

At last 👍

@robin-obrien
Copy link

@istr, Any estimation when can it be released, I wish I could start using it as it seems to be critical numerous memory issues

@paazmaya
Copy link
Contributor

Would assume this coming out in the next beta.... couple weeks.
The actual 1.1 hopefully before Summer

@robin-obrien
Copy link

Thanks for the update, I am facing fatal memory issues running on Windows (BTW: On Mac, it is much more stable) and I am very eager having this fix. I would very grateful if you can guide me how to implement the fix in my code. Can I do it in my program or should I download the latest CasperJS code and compile it to a program?

@istr
Copy link
Collaborator

istr commented Mar 1, 2016

I'll try to review the change-set against beta 5 and publish beta 6 next weekend.

@mickaelandrieu
Copy link
Member

@istr and my students will enjoy this new version as well :) Thank you!

@robin-obrien
Copy link

@istr I really appreciate this

@zarkerus
Copy link

zarkerus commented Apr 4, 2016

@ilangv @cptX When I try to use function newPage(), task manager show there are too many TCP Connection and the application got hanged, not sure why but that might be the reason

`casper.start();

var json = require('list.json');
casper.then(function() {
var current = 0
var end = json.length;
for (;current < end;) {
(function(cntr) {
casper.thenOpen(json[cntr].url, function() {
console.log(json[cntr].id);
var data = this.evaluate(function() {
return $("title");
});
}).then(function() {
casper.page.close();
casper.newPage();
});;
})(current);
current++;
}
});
casper.run();`

@istr
Copy link
Collaborator

istr commented Apr 4, 2016

@NghiemDuongHung Please open a new issue with your finding. Thanks.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet