-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Phantom crawls #3
Comments
A .txt file with the UUIDs of 3466 datasets in this state can be downloaded here. Thanks @timrobertson100 for offering to cleanup their state. It would be great if a recrawl could be initiated for these datasets, although for my purposes identifying orphans it's fine to skip recrawling datasets from Pangae, GEO-Tag der Artenvielfalt. Thanks again. |
Many BioCASE datasets have not been crawled for months and I suspect it is because of the phantom crawls issue. Here are a few examples from the GBIF registry:
|
There has been a couple of examples of datasets validating properly when submitted to the validator, but seemingly having indexing problems that could be related to this: gbif/portal-feedback#590 Also the crawler has ignored this dataset for a while and it is not pssoble to trigger the crawl from the registry or through the API: |
I deleted the empty rows in Postgres, which has allowed the crawler scheduler to start a crawl on ~7000 datasets. No new empty rows have appeared, but I'm going to see if it happens again before closing this. select * from crawl_history where started_crawling is null order by started_crawling; @gbif/content: Many hundreds of datasets have changes, but hadn't been ingested for (sometimes) over a year. I noticed at least one which duplicated itself. Also, it's not finished yet. |
One case where these appear is when crawler-clis are restarted, interrupting longrunning XML crawls, and those long-running crawls restarted. I'm not sure exactly which part caused the bad records. The ideal method for stopping crawling is probably to stop the scheduler and coordinator, then wait until everything is finished before stopping the crawlers and fragmenters. |
The SQL above gives no more failures 1¾ years later, so I will close this. |
Below is a detailed description of a phantom crawl attempt having no start date and that never comes to an end:**
Mammals in MZNA-VERT: project CAS was registered Mar 22, 2016 and republished again on Apr 14, 2016 according to its version history. Its DwC-A endpoint is confirmed working.
GBIF automatically attempts to re-crawl each dataset every 7 days. Unfortunately, this particular dataset has only ever been re-crawled 4 times since the date it was registered on Mar 22, 2016:
Crawl attempt #4 is a phantom crawl having no start date and that has supposedly been running ever since the first week of April 2016.
The consequence of this problem, is that GBIF failed to re-index the latest changes made to the dataset when it was republished on Apr 14, 2016.
Additionally, it causes this dataset to be flagged as a false positive orphan dataset.
To sum up the problem
Phantom crawl attempts prevent datasets from being recrawled and leave stale data in GBIF's index. It also causes datasets to be erroneously flagged as orphaned datasets because they have failed to be re-crawled within several months.
Screenshots related to example above:
Registry console showing crawl history and crawl running since April 2016
Properties of a phantom crawl:
The text was updated successfully, but these errors were encountered: