-
Notifications
You must be signed in to change notification settings - Fork 256
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Delay in updates to status index #340
Comments
The URLs fetched twice have in common that they are fetched shortly before ES is queried for new URLs:
|
Thanks @sebastian-nagel, this explains why the min.delay.queries param does not work for all URLs.
that was actually incorrect (and reassuring) |
Couldn't it be that the document from El Watan is fetched at 11:37:25.520, acked at 11:37:25.892, and 80 ms later at 11:37:25.972 spout 6 asks for new URLs and gets 90 back at 11:37:25.985. Just one possibility, and 80ms is quite short. Once the URL is queued in Fetcher, it will be fetched again, right? But this may happen later. Attached the full log: worker.log.zip |
yes, we are saying the same thing. The min.delay.queries param doesn't help with URLs at the end of the buffer as the time is counted since the previous request to ES was completed.
yes, once it's in, it's in! the same applies to URLs that sit in the Fetcher internal queues for too long and get failed() because of a timeout. The spout releases them and chances are that they get returned again by the ES query and go straight back to the Fetcher queue. |
…after acked: - acked elements are kept in a cache with configurable size and duration - fix for apache#340 to avoid that URLs are fetched a second time
First patch which keeps the items longer to avoid duplicates
|
... and this patch causes the worker to fail sometimes while placing items in the deletion queue/cache:
|
Let's first put things in an abstract class (#347) then look at this delay cache business |
Sebastian observed that his WARC files contained between 10-15% duplicates when crawling with ES. The most likely explanation is that there is a delay between the moment where an update is committed to the ES index and reflected in the search and when the corresponding tuple is acked (which happens a lot sooner). This issue is more likely on small crawls where the diversity of URLs is low.
One way around it is to set
es.status.min.delay.queries
to a larger value than the default 2 secs i.e. the minimal amount of time allowed between two queries to ES. By setting it to 60 secs, Seb saw a drop to 0.5-1% duplicates. This means that it can take more than a minute between the moment a tuple is acked and when the update is reflected in the search results.Idea : we could remove the tuples from the beingProcessed hash after some additional time, this way we wouldn't refetch the same URL too soon. The various ES spouts have an overlap in code, we could create an abstract class and share the functionality there.
The text was updated successfully, but these errors were encountered: