Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to schedule the spider to run daily job on 1am and are there any duplicate content check? #54

Closed
bruceliu2008 opened this issue May 17, 2018 · 2 comments
Labels

Comments

@bruceliu2008
Copy link

Hi,
I need to run the spider everyday on 1am or some specific time, are there any schedule available for this?

Another question is that are there any content duplicate check? for example, I do crawling everyday for website www.abc.com/aa.html for its xpath '/html/body/div[3]/div/div[2]/section', but if the content of '/html/body/div[3]/div/div[2]/section' is exactly the same as my last crawling, then I will just ignore it.

Thank you.

@zlzforever
Copy link
Collaborator

zlzforever commented May 17, 2018

  1. In windows: windows task scheduler
  2. In linux: cron
  3. See codes in DotnetSpider.Sample.Program: Startup.Run("-s:MultiSupplementSpider", "-tid:1", "-i:guid");
  4. If you have a lots of machines, may you can try DotnetSpider.Enterprise.
  5. No content duplicate check. I think you can add a unique column to save the content's md5.

@zlzforever
Copy link
Collaborator

closed. Open it if you need support again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants