Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Phantom crawls #3

Closed
kbraak opened this issue Jun 9, 2017 · 6 comments
Closed

Phantom crawls #3

kbraak opened this issue Jun 9, 2017 · 6 comments

Comments

@kbraak
Copy link
Contributor

kbraak commented Jun 9, 2017

Below is a detailed description of a phantom crawl attempt having no start date and that never comes to an end:**

Mammals in MZNA-VERT: project CAS was registered Mar 22, 2016 and republished again on Apr 14, 2016 according to its version history. Its DwC-A endpoint is confirmed working.

GBIF automatically attempts to re-crawl each dataset every 7 days. Unfortunately, this particular dataset has only ever been re-crawled 4 times since the date it was registered on Mar 22, 2016:

  • Crawl attempt # 1 ran Mar 22, 2016, finishing successfully
  • Crawl attempt # 2 ran Mar 22, 2016 finishing successfully but not modified
  • Crawl attempt # 3 ran Mar 29, 2016, finishing successfully but not modified
  • Crawl attempt # 4 is currently running..

Crawl attempt #4 is a phantom crawl having no start date and that has supposedly been running ever since the first week of April 2016.

The consequence of this problem, is that GBIF failed to re-index the latest changes made to the dataset when it was republished on Apr 14, 2016.

Additionally, it causes this dataset to be flagged as a false positive orphan dataset.

To sum up the problem

Phantom crawl attempts prevent datasets from being recrawled and leave stale data in GBIF's index. It also causes datasets to be erroneously flagged as orphaned datasets because they have failed to be re-crawled within several months.

Screenshots related to example above:

Registry console showing crawl history and crawl running since April 2016
screen shot 2017-06-09 at 10 41 54

Properties of a phantom crawl:
screen shot 2017-06-09 at 10 17 49

kbraak added a commit to gbif/watchdog that referenced this issue Jun 14, 2017
@kbraak
Copy link
Contributor Author

kbraak commented Jun 14, 2017

A .txt file with the UUIDs of 3466 datasets in this state can be downloaded here. Thanks @timrobertson100 for offering to cleanup their state. It would be great if a recrawl could be initiated for these datasets, although for my purposes identifying orphans it's fine to skip recrawling datasets from Pangae, GEO-Tag der Artenvielfalt. Thanks again.

@jlegind
Copy link

jlegind commented Oct 27, 2017

Many BioCASE datasets have not been crawled for months and I suspect it is because of the phantom crawls issue. Here are a few examples from the GBIF registry:

@thomasstjerne
Copy link

There has been a couple of examples of datasets validating properly when submitted to the validator, but seemingly having indexing problems that could be related to this:

gbif/portal-feedback#590
gbif/portal-feedback#591

Also the crawler has ignored this dataset for a while and it is not pssoble to trigger the crawl from the registry or through the API:
https://www.gbif.org/dataset/2b94a042-fe01-4d9f-8995-d996c21d33cd

@MattBlissett
Copy link
Member

I deleted the empty rows in Postgres, which has allowed the crawler scheduler to start a crawl on ~7000 datasets.

No new empty rows have appeared, but I'm going to see if it happens again before closing this.

select * from crawl_history where started_crawling is null order by started_crawling;

@gbif/content: Many hundreds of datasets have changes, but hadn't been ingested for (sometimes) over a year. I noticed at least one which duplicated itself. Also, it's not finished yet.

@MattBlissett
Copy link
Member

One case where these appear is when crawler-clis are restarted, interrupting longrunning XML crawls, and those long-running crawls restarted. I'm not sure exactly which part caused the bad records.

The ideal method for stopping crawling is probably to stop the scheduler and coordinator, then wait until everything is finished before stopping the crawlers and fragmenters.

@MattBlissett
Copy link
Member

The SQL above gives no more failures 1¾ years later, so I will close this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants