Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

请问多列表情况如何采集 #57

Closed
qianbaidu opened this issue Sep 26, 2017 · 2 comments
Closed

请问多列表情况如何采集 #57

qianbaidu opened this issue Sep 26, 2017 · 2 comments

Comments

@qianbaidu
Copy link

目前遇到一个问题是:

  • 目标站一个列表十几万页:

  • 问题:

  • 采集列表没有入库,中间断掉所有数据就没了,如果一页页采集需要写十万多个列表页地址,也不合适

  • 列表没抓取完,并不会开始内容抓取

  • 希望通过方式:

  1. 一个线程抓取列表 、一个线程抓取内容页,区分开,但是没有任务分布式案例,就是如何将任务push到调度线程(新手跪求demo 3q)
  2. 一个线程:列表抓取入库做记录;另外一个线程读库开始抓取内容并标记抓取状态,不知道可行不
@qianbaidu
Copy link
Author

如何在server里面push任务详情页面url给调度中心,然后client如何读取调度中心的详情页面url地址抓取?

@andeya
Copy link
Owner

andeya commented Oct 9, 2017

pholcus规则的执行逻辑:

  1. 每个rule都是并发执行的,用户完全可以在任意一个rule中进行Output;
  2. 你提到的“列表没抓取完,并不会开始内容抓取”,其实是你理解片面了,这种情况是你横行向优先逻辑。你也可以写纵向优先抓取的逻辑。这个是靠Request.Priority来实现,该字段数值越多,优先级越高。
    纵向优先的示例:
    https://github.com/henrylee2cn/pholcus_lib/blob/master/taobao/taobao.go#L168
    https://github.com/henrylee2cn/pholcus_lib/blob/master/taobao/taobao.go#L213

@andeya andeya closed this as completed Dec 24, 2019
@andeya andeya reopened this Dec 24, 2019
@andeya andeya closed this as completed Dec 24, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants