Phantom crawls #3

kbraak · 2017-06-09T09:17:48Z

Below is a detailed description of a phantom crawl attempt having no start date and that never comes to an end:**

Mammals in MZNA-VERT: project CAS was registered Mar 22, 2016 and republished again on Apr 14, 2016 according to its version history. Its DwC-A endpoint is confirmed working.

GBIF automatically attempts to re-crawl each dataset every 7 days. Unfortunately, this particular dataset has only ever been re-crawled 4 times since the date it was registered on Mar 22, 2016:

Crawl attempt # 1 ran Mar 22, 2016, finishing successfully
Crawl attempt # 2 ran Mar 22, 2016 finishing successfully but not modified
Crawl attempt # 3 ran Mar 29, 2016, finishing successfully but not modified
Crawl attempt # 4 is currently running..

Crawl attempt #4 is a phantom crawl having no start date and that has supposedly been running ever since the first week of April 2016.

The consequence of this problem, is that GBIF failed to re-index the latest changes made to the dataset when it was republished on Apr 14, 2016.

Additionally, it causes this dataset to be flagged as a false positive orphan dataset.

To sum up the problem

Phantom crawl attempts prevent datasets from being recrawled and leave stale data in GBIF's index. It also causes datasets to be erroneously flagged as orphaned datasets because they have failed to be re-crawled within several months.

Screenshots related to example above:

Registry console showing crawl history and crawl running since April 2016

Properties of a phantom crawl:

Related to gbif/crawler#3

kbraak · 2017-06-14T16:36:07Z

A .txt file with the UUIDs of 3466 datasets in this state can be downloaded here. Thanks @timrobertson100 for offering to cleanup their state. It would be great if a recrawl could be initiated for these datasets, although for my purposes identifying orphans it's fine to skip recrawling datasets from Pangae, GEO-Tag der Artenvielfalt. Thanks again.

jlegind · 2017-10-27T09:30:14Z

Many BioCASE datasets have not been crawled for months and I suspect it is because of the phantom crawls issue. Here are a few examples from the GBIF registry:

https://registry.gbif.org/web/#/dataset/4c266400-b0a3-11e2-a01d-00145eb45e9a
- Finished crawling 20 days ago - no new crawls since. 33 pages crawled.
https://registry.gbif.org/web/#/dataset/ee739f75-e7b3-4f21-be1f-062b1ff94c78
- Created two years ago and is on crawl ID 6 which seems stuck.

thomasstjerne · 2017-10-27T10:02:15Z

There has been a couple of examples of datasets validating properly when submitted to the validator, but seemingly having indexing problems that could be related to this:

gbif/portal-feedback#590
gbif/portal-feedback#591

Also the crawler has ignored this dataset for a while and it is not pssoble to trigger the crawl from the registry or through the API:
https://www.gbif.org/dataset/2b94a042-fe01-4d9f-8995-d996c21d33cd

MattBlissett · 2018-03-08T18:09:25Z

I deleted the empty rows in Postgres, which has allowed the crawler scheduler to start a crawl on ~7000 datasets.

No new empty rows have appeared, but I'm going to see if it happens again before closing this.

select * from crawl_history where started_crawling is null order by started_crawling;

@gbif/content: Many hundreds of datasets have changes, but hadn't been ingested for (sometimes) over a year. I noticed at least one which duplicated itself. Also, it's not finished yet.

MattBlissett · 2018-03-14T10:55:33Z

One case where these appear is when crawler-clis are restarted, interrupting longrunning XML crawls, and those long-running crawls restarted. I'm not sure exactly which part caused the bad records.

The ideal method for stopping crawling is probably to stop the scheduler and coordinator, then wait until everything is finished before stopping the crawlers and fragmenters.

MattBlissett · 2019-11-25T14:31:44Z

The SQL above gives no more failures 1¾ years later, so I will close this.

kbraak assigned cgendreau Jun 9, 2017

kbraak added a commit to gbif/watchdog that referenced this issue Jun 14, 2017

List of dataset UUIDs with phantom crawls

7f4395c

Related to gbif/crawler#3

This was referenced Jul 7, 2017

Paleobiology Database not updated/updating gbif/portal-feedback#239

Closed

Orphan datasets from Sweden gbif/watchdog#9

Closed

Orphan datasets from Germany gbif/watchdog#10

Open

kbraak mentioned this issue Jul 17, 2017

Another orphaned U.S. dataset? gbif/portal-feedback#246

Closed

kbraak mentioned this issue Aug 3, 2017

Orphan datasets from Brazil gbif/watchdog#17

Open

cgendreau mentioned this issue Aug 21, 2017

EML endpoints ignored by CrawlerCoordinatorService #7

Closed

This was referenced Oct 27, 2017

dataset validates ok, but is not indexed gbif/portal-feedback#591

Closed

Validator reports dataset can be indexed by GBIF - yet it wasn't in reality gbif/portal-feedback#590

Closed

cgendreau removed their assignment Oct 27, 2017

MattBlissett mentioned this issue Mar 19, 2018

Unable to start crawl gbif/registry#37

Closed

MattBlissett closed this as completed Nov 25, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Phantom crawls #3

Phantom crawls #3

kbraak commented Jun 9, 2017

kbraak commented Jun 14, 2017

jlegind commented Oct 27, 2017 •

edited

Loading

thomasstjerne commented Oct 27, 2017

MattBlissett commented Mar 8, 2018

MattBlissett commented Mar 14, 2018

MattBlissett commented Nov 25, 2019

Phantom crawls #3

Phantom crawls #3

Comments

kbraak commented Jun 9, 2017

Below is a detailed description of a phantom crawl attempt having no start date and that never comes to an end:**

To sum up the problem

Screenshots related to example above:

kbraak commented Jun 14, 2017

jlegind commented Oct 27, 2017 • edited Loading

thomasstjerne commented Oct 27, 2017

MattBlissett commented Mar 8, 2018

MattBlissett commented Mar 14, 2018

MattBlissett commented Nov 25, 2019

jlegind commented Oct 27, 2017 •

edited

Loading