Phantomjs, performance #458

volvofixthis · 2016-06-01T12:38:37Z

Found pretty hard scaling performance of my spider when i need js render. What i understand that i need decrease number of poolsize for fetcher am i right? So i start 3 phantomjs+fetcher and set for them poolsize 10. Am i right? Maybe you have something to append?

binux · 2016-06-01T12:59:18Z

deploy multiple phantomjs instance with a load balance frontend before connect to one fetcher.

e.g. this docker-compose structure.

phantomjs:
  image: 'binux/pyspider:latest'
  command: phantomjs
  cpu_shares: 512
  environment:
    - 'EXCLUDE_PORTS=5000,23333,24444'
  expose:
    - '25555'
  mem_limit: 512m
  restart: always
phantomjs-lb:
  image: 'dockercloud/haproxy:latest'
  links:
    - phantomjs
  restart: always

fetcher:
  image: 'binux/pyspider:latest'
  command: '--phantomjs-proxy "phantomjs:80" fetcher'
  cpu_shares: 512
  environment:
    - 'EXCLUDE_PORTS=5000,25555,23333'
  links:
    - 'phantomjs-lb:phantomjs'
  mem_limit: 128m
  restart: always

volvofixthis · 2016-06-01T15:03:38Z

it is cool idea about haproxy, thank you for sharing. I see in logs of phatomjs that there are very long requests, about 70 seconds, i dunno why :( Can it be because js_script?
function() {
setTimeout(function(){ document.getElementsByClassName("js-realtor-card-phone-trigger")[0].click(); }, 10);
}

And what will happen if js_script for some reason will fail?

binux · 2016-06-01T15:24:10Z

In most rendering a page is slow, it will wait till every resources in the page loaded.
And there are a lot of reason a script will fail, including:

element not exists when execute
timeout arrived before page load, don't have time to execute js script
js script executed, but further resources don't have time or just not loaded
js script executed, but page js not loaded, doesn't reaction as expected.

volvofixthis · 2016-06-01T15:31:09Z

I mean what will happen if error accures? Fetch will fail or what? I see that plenty of my fetches are in active state. I don't care why it accures i understand that there can be plenty of reasons. Before migrating to pyspider i had phantomjs ruled by selenium which hammered page with js script again and again with defined timeout.

binux · 2016-06-01T16:07:21Z

nothing happen, just like it hasn't executed. You should detect in your script with some assert, and let the task retry.

volvofixthis · 2016-06-02T12:41:24Z

Hello! I tried make configuration with docker compose. So i made such configuration file

version: '2'
services:
  phantomjs:
    image: binux/pyspider:latest
    command: phantomjs
    cpu_shares: 512
    environment:
      - 'EXCLUDE_PORTS=5000,23333,24444'
    expose:
      - '25555'
    mem_limit: 512m
    restart: always
  phantomjs-lb:
    image: dockercloud/haproxy:latest
    links:
      - phantomjs
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
    restart: always
  fetcher:
    image: laki9/pyspider:python3
    external_links:
      - 'nedvigka_mysql:mysql'
      - 'nedvigka_redis:redis'
    command: '--config config.json --phantomjs-proxy "phantomjs:80" fetcher --no-xmlrpc'
    working_dir: /home/ubuntu/conf
    cpu_shares: 512
    environment:
      - 'EXCLUDE_PORTS=5000,25555,23333'
    links:
      - 'phantomjs-lb:phantomjs'
    volumes:
      - ./conf:/home/ubuntu/conf
    mem_limit: 128m
    restart: always
  result:
    image: laki9/pyspider:python3
    external_links:
      - 'nedvigka_mysql:mysql'
      - 'nedvigka_redis:redis'
    command: '--config config.json result_worker'
    working_dir: /home/ubuntu/conf
    volumes:
      - ./conf:/home/ubuntu/conf
  processor:
    image: laki9/pyspider:python3
    external_links:
      - 'nedvigka_mysql:mysql'
      - 'nedvigka_redis:redis'
    command: processor
  scheduler:
    image: laki9/pyspider:python3
    external_links:
      - 'nedvigka_mysql:mysql'
      - 'nedvigka_redis:redis'
    command: '--config config.json scheduler'
    working_dir: /home/ubuntu/conf
    volumes:
      - ./conf:/home/ubuntu/conf
  webui:
    image: laki9/pyspider:python3
    external_links:
      - 'nedvigka_mysql:mysql'
      - 'nedvigka_redis:redis'
    links:
      - scheduler
      - 'phantomjs-lb:phantomjs'
    command: '--config config.json --phantomjs-proxy "phantomjs:80" webui'
    working_dir: /home/ubuntu/conf
    ports:
      - "5000:5000"
    volumes:
      - ./conf:/home/ubuntu/conf

Problem is that all working ok, but i can't start task in dashboard and can't see current progrems, verywherre i see connect to scheduler error
In console of webui i see:
webui_1 | [W 160602 12:39:54 index:106] connect to scheduler rpc error: ConnectionRefusedError(111, 'Connection refused')
webui_1 | [W 160602 12:39:55 task:43] connect to scheduler rpc error: ConnectionRefusedError(111, 'Connection refused')
webui_1 | [W 160602 12:40:10 index:106] connect to scheduler rpc error: ConnectionRefusedError(111, 'Connection refused')

My configuration file looks like this:

{
  "taskdb": "mysql+taskdb://root:ineedmysql@mysql/taskdb",
  "projectdb": "mysql+projectdb://root:ineedmysql@mysql/projectdb",
  "resultdb": "mysql+resultdb://root:ineedmysql@mysql/resultdb",
  "message_queue": "redis://redis:6379/novosti",
  "webui": {
    "username": "root",
    "password": "soyouarehuman",
    "need-auth": true,
    "port": 5000
  },
  "result_worker": {
    "result_cls": "my_result_worker.MyResultWorker"
  }
}

What i miss?

volvofixthis · 2016-06-02T13:00:50Z

I run little test to ensure that scheduler accessible from webui:
user@host:~/nedvigka$ sudo docker exec -it nedvigka_webui_1 bash
root@ee2a19090d65:/home/ubuntu/conf# ping scheduler
PING scheduler (172.18.0.4) 56(84) bytes of data.
64 bytes from nedvigka_scheduler_1.nedvigka_default (172.18.0.4): icmp_seq=1 ttl=64 time=0.120 ms
64 bytes from nedvigka_scheduler_1.nedvigka_default (172.18.0.4): icmp_seq=2 ttl=64 time=0.174 ms
64 bytes from nedvigka_scheduler_1.nedvigka_default (172.18.0.4): icmp_seq=3 ttl=64 time=0.057 ms
64 bytes from nedvigka_scheduler_1.nedvigka_default (172.18.0.4): icmp_seq=4 ttl=64 time=0.057 ms
^C
--- scheduler ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 2999ms
rtt min/avg/max/mdev = 0.057/0.102/0.174/0.048 ms
root@ee2a19090d65:/home/ubuntu/conf# nc 172.18.0.4 23333

root@ee2a19090d65:/home/ubuntu/conf#
root@ee2a19090d65:/home/ubuntu/conf# nc 172.18.0.4 23333

root@ee2a19090d65:/home/ubuntu/conf#

volvofixthis · 2016-06-02T14:03:30Z

Ok, i got that i need use --scheduler-rpc:
https://dpaste.de/5buy

Currently i don't see any errors, dashboard working well. But now my problem that spider is just don't work when i start him. This what i see i log:
scheduler_1 | [I 160602 13:52:07 scheduler:424] in 5m: new:0,success:0,retry:0,failed:0
scheduler_1 | [I 160602 13:52:13 scheduler:771] select ya_ru:_on_get_info data:,_on_get_info
fetcher_1 | [I 160602 13:52:13 tornado_fetcher:178] [200] ya_ru:_on_get_info data:,_on_get_info 0s
scheduler_1 | [I 160602 13:52:23 scheduler:628] new task ya_ru:on_start data:,on_start
scheduler_1 | [I 160602 13:52:24 scheduler:771] select ya_ru:on_start data:,on_start
fetcher_1 | [I 160602 13:52:24 tornado_fetcher:178] [200] ya_ru:on_start data:,on_start 0s
scheduler_1 | [I 160602 13:53:07 scheduler:424] in 5m: new:1,success:0,retry:0,failed:0 ya_ru:1,0,0,0
scheduler_1 | [I 160602 13:54:07 scheduler:424] in 5m: new:1,success:0,retry:0,failed:0 ya_ru:1,0,0,0
scheduler_1 | [I 160602 13:55:07 scheduler:424] in 5m: new:1,success:0,retry:0,failed:0 ya_ru:1,0,0,0
scheduler_1 | [I 160602 13:55:17 scheduler:664] restart task ya_ru:on_start data:,on_start
scheduler_1 | [I 160602 13:55:17 scheduler:664] restart task ya_ru:on_start data:,on_start
scheduler_1 | [I 160602 13:56:07 scheduler:424] in 5m: new:1,success:0,retry:0,failed:0 ya_ru:1,0,0,0
scheduler_1 | [I 160602 13:57:07 scheduler:424] in 5m: new:1,success:0,retry:0,failed:0 ya_ru:1,0,0,0
scheduler_1 | [I 160602 13:58:07 scheduler:424] in 5m: new:0,success:0,retry:0,failed:0 ya_ru:0,0,0,0
scheduler_1 | [I 160602 13:59:07 scheduler:424] in 5m: new:0,success:0,retry:0,failed:0 ya_ru:0,0,0,0
scheduler_1 | [I 160602 14:00:07 scheduler:424] in 5m: new:0,success:0,retry:0,failed:0 ya_ru:0,0,0,0

Look of tasks in dashboard:
https://dl.dropboxusercontent.com/u/25725476/screenshots/screenshot-2016.06.02-17%3A01%3A02.png

Source code of spider:
https://dpaste.de/yHwh

binux · 2016-06-02T14:11:59Z

command of processor doesn't contains --config config.json

volvofixthis · 2016-06-02T14:29:19Z

You know you are awesome mate, i just stucked on this. Was crazy about setting up properly. But now all working ok! Dunno how i missed this.

Is there any chance you can crosspost such posts in english too? http://blog.binux.me/2016/05/deployment-of-demopyspiderorg/ I noticed this right now and looks like it is very usefull, but with translator post become very broken.

Can i support you somehow? I don't have much, but 20 bucks is 20 bucks?) I have paypal and can pay by credit card directly.

binux · 2016-06-02T14:55:58Z

You can find the clue from log, scheduler had select(dispatch) the task, fetcher have received it, but no next. And on dashboard, you should find pending messages between fetcher and processor.

Yes, I will translate that post and put it into docs.pyspider.org
Users are best support for the project. As I have a job to feed me and hadn't spend much time on the project, thanks.

volvofixthis · 2016-06-02T16:45:12Z

Ok i got it.

I have tried this configuration in field. I see that with time phantomjs answering slower and slower and at the end i see this error:
[E 160602 16:37:51 tornado_fetcher:200] [599] xxxx:e8183d209f66e13461bf0a25de78b868 http://xxxx, ValueError('Expecting value: line 1 column 1 (char 0)',) 50.01s

I tried with --poolsize 10, nothing changed. I am not swaping or something, i have enough ram. Phantomjs memory cosumption is very high, it is easily pass 1gb limit. Can't believe it is real.

binux · 2016-06-02T17:17:37Z

rendering a page is very slow and heavy, all of the current handless browser implement have memory leak issue. I want to port js render to splash and one of implement of electron, but it wouldn't solve the problem, just give you another choice.

I restart phantomjs instance frequently to avoid memory leak issue.

volvofixthis · 2016-06-02T17:37:06Z

I checked, if i load link which i am interested in phantomjs it consumes around 200mb of ram. How hard will write something which handle one request with one version of phantomjs and after that phantomjs just die? I think maybe i can modify phantomjs_fetcher.js to just exit after single request? And balance possible errors with haproxy?

Also looked at splash, it is very intresting.

binux · 2016-06-02T21:07:15Z

It's very easy to kill a render server, we are running splash in business, the whole instance can be killed easily by some certain web page.

volvofixthis · 2016-06-02T23:12:35Z

Can't get what you mean. You mean that there are in the wild pages, which can kill reder server or what? I mean i want phantomjs to execute only one request, so i can lower possible memory leaks to minimum.

binux · 2016-06-02T23:57:46Z

I mean it's not easy to implement a "reliable" render server, yes, execute and re-fork for one request can somehow isolate the failure between requests (but need more resources to fork).

But there are some web pages, with only one request can kill the render service, you still need to monitor the processor and kill them when needed.

volvofixthis · 2016-06-03T12:00:22Z

Tried single request per one phantomjs instance. Now i can get stable render time, and no so noticable memory leaks. But sometimes i can see 500 errors again, but i expect this. Without any control service, which will dispatch queries right i don't think there will be error free behaviour.

I tried splash too, i tried ab him with my task, 10 concurency and 1000 requests, it failed at 200 with qt error.

volvofixthis · 2016-06-03T12:50:12Z

Can be this error because of phantomjs instance died because of phantom.exit(); ?
track.process 0.27ms
self.call() not implemented!
[E 160603 12:46:21 base_handler:195] self.call() not implemented!
Traceback (most recent call last):
File "/opt/pyspider/pyspider/libs/base_handler.py", line 188, in run_task
result = self._run_task(task, response)
File "/opt/pyspider/pyspider/libs/base_handler.py", line 160, in _run_task
raise NotImplementedError("self.%s() not implemented!" % callback)
NotImplementedError: self.call() not implemented!

{
"exception": "self.call() not implemented!",
"follows": 0,
"logs": "[E 160603 12:46:21 base_handler:195] self.call() not implemented!\n Traceback (most recent call last):\n File "/opt/pyspider/pyspider/libs/base_handler.py", line 188, in run_task\n result = self._run_task(task, response)\n File "/opt/pyspider/pyspider/libs/base_handler.py", line 160, in _run_task\n raise NotImplementedError("self.%s() not implemented!" % callback)\n NotImplementedError: self.call() not implemented!\n",
"ok": false,
"result": null,
"time": 0.0002689361572265625
}

binux · 2016-06-04T17:29:24Z

have no idea, need more logs when it happened.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Phantomjs, performance #458

Phantomjs, performance #458

volvofixthis commented Jun 1, 2016

binux commented Jun 1, 2016

volvofixthis commented Jun 1, 2016

binux commented Jun 1, 2016

volvofixthis commented Jun 1, 2016

binux commented Jun 1, 2016

volvofixthis commented Jun 2, 2016 •

edited by binux

Loading

volvofixthis commented Jun 2, 2016

volvofixthis commented Jun 2, 2016

binux commented Jun 2, 2016

volvofixthis commented Jun 2, 2016

binux commented Jun 2, 2016

volvofixthis commented Jun 2, 2016 •

edited

Loading

binux commented Jun 2, 2016

volvofixthis commented Jun 2, 2016

binux commented Jun 2, 2016

volvofixthis commented Jun 2, 2016

binux commented Jun 2, 2016

volvofixthis commented Jun 3, 2016

volvofixthis commented Jun 3, 2016

binux commented Jun 4, 2016

Phantomjs, performance #458

Phantomjs, performance #458

Comments

volvofixthis commented Jun 1, 2016

binux commented Jun 1, 2016

volvofixthis commented Jun 1, 2016

binux commented Jun 1, 2016

volvofixthis commented Jun 1, 2016

binux commented Jun 1, 2016

volvofixthis commented Jun 2, 2016 • edited by binux Loading

volvofixthis commented Jun 2, 2016

volvofixthis commented Jun 2, 2016

binux commented Jun 2, 2016

volvofixthis commented Jun 2, 2016

binux commented Jun 2, 2016

volvofixthis commented Jun 2, 2016 • edited Loading

binux commented Jun 2, 2016

volvofixthis commented Jun 2, 2016

binux commented Jun 2, 2016

volvofixthis commented Jun 2, 2016

binux commented Jun 2, 2016

volvofixthis commented Jun 3, 2016

volvofixthis commented Jun 3, 2016

binux commented Jun 4, 2016

volvofixthis commented Jun 2, 2016 •

edited by binux

Loading

volvofixthis commented Jun 2, 2016 •

edited

Loading