Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

addUrl添加的入口地址被Scheduler过滤后,下次无法启动 #501

Closed
JsonSong89 opened this issue Mar 22, 2017 · 9 comments
Closed

Comments

@JsonSong89
Copy link

我添加了入口 Spider.create(HahaProcessor()) .addUrl("http://www.haha.mx/rising")
设置的FileCacheQueueScheduler的Scheduler
爬取一次后,此url被标记为已爬取
image
事实上我是每半小时从这个页面取到最热门的几条,我希望这个入口不会被过滤,但其他内容页的url照常过滤,应该怎么做啊?

@JsonSong89
Copy link
Author

JsonSong89 commented Mar 22, 2017

或者有什么手段,我能够标识一下,指明http://www.haha.mx/rising 这个入口是不入Scheduler的?

@JsonSong89
Copy link
Author

这里感觉很奇怪,
image
貌似可以在request里面指定CYCLE_TRIED_TIMES以达到过滤入口url的目的,
但在Spider:addUrls-> addRequest(new Request(url))-> scheduler.push(request, this);这里就已经push进去了,根本没机会去打标记啊.

@code4craft
Copy link
Owner

scheduler.getDuplicatedRemover().resetDuplicateCheck(),这个可以让所有页面可以再次添加。

@JsonSong89
Copy link
Author

@code4craft 谢谢回复
我看了下,这个貌似是一次清掉了所有的url,而我想要的是,只清掉一个入口url. 看来只能是自己写scheduler了.

@TGhoul
Copy link

TGhoul commented Dec 13, 2017

你好,请问你最后是如何解决的?我遇到了和你相同的问题。谢谢

@JsonSong89
Copy link
Author

@TGhoul 就是自己写啊,我照着抄了一个scheduler,
只是在实例化的时候,可以把白名单urls弄进去,然后重复判断会放过这些白名单里面的url.

@TGhoul
Copy link

TGhoul commented Dec 13, 2017

能否发一个demo给我看看吗?谢谢

@JsonSong89
Copy link
Author

public class FileCacheFilterInitUrlScheduler extends DuplicateRemovedScheduler implements MonitorableScheduler, Closeable {

image

` private void initDuplicateRemover() {
setDuplicateRemover(
new DuplicateRemover() {
@OverRide
public boolean isDuplicate(Request request, Task task) {
if (!inited.get()) {
init(task);
}
//修改:过滤掉初始url
if (filterUrls.contains(request.getUrl())) {
return false;
}

                    return !urls.add(request.getUrl());
                }

`

加个filterUrls属性,实例化时候设置,重复判断加上,没了.

@TGhoul
Copy link

TGhoul commented Dec 13, 2017

非常感谢!

@sutra sutra added this to the WebMagic-0.7.4 milestone Oct 26, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants