-
Notifications
You must be signed in to change notification settings - Fork 273
Elasticsearch IndexerBolt: tuples with canonical URL may not get acked #832
Copy link
Copy link
Closed
Labels
Milestone
Description
This issue was seen in a topology fed by WARCSpout. The failure of unacked tuples triggered #825. The failure was reproducible: if the topology was run again with the same input an mostly overlapping (but not identical) set of URLs were logged as failed. In addition, the failed URLs are missing in the status index.
A closer analysis showed that pages with canonical URL were involved. One example:
2020-10-06 19:41:44.202 c.d.s.w.WARCSpout Thread-4-spout-executor[8 8] [INFO] Fetched https://www.obozrevatel.com/ukr/dnipro/city/u-dnipri-ta-oblasti-ogolosili-shtormove-poperedzhennya.htm with status 200
2020-10-06 19:41:44.351 c.d.s.w.WARCSpout Thread-4-spout-executor[8 8] [INFO] Fetched https://www.obozrevatel.com/ukr/dnipro/city/u-dnipri-ta-oblasti-ogolosili-shtormove-poperedzhennya/amp.htm with status 200
2020-10-06 19:41:45.608 c.d.s.b.JSoupParserBolt Thread-14-parse-executor[6 6] [INFO] Parsing : starting https://www.obozrevatel.com/ukr/dnipro/city/u-dnipri-ta-oblasti-ogolosili-shtormove-poperedzhennya.htm
2020-10-06 19:41:45.636 c.d.s.b.JSoupParserBolt Thread-14-parse-executor[6 6] [INFO] Parsed https://www.obozrevatel.com/ukr/dnipro/city/u-dnipri-ta-oblasti-ogolosili-shtormove-poperedzhennya.htm in 23 msec
2020-10-06 19:41:45.674 c.d.s.b.JSoupParserBolt Thread-14-parse-executor[6 6] [INFO] Total for https://www.obozrevatel.com/ukr/dnipro/city/u-dnipri-ta-oblasti-ogolosili-shtormove-poperedzhennya.htm - 61 msec
2020-10-06 19:41:45.675 c.d.s.e.b.IndexerBolt Thread-10-index-executor[3 3] [INFO] Indexing https://www.obozrevatel.com/ukr/dnipro/city/u-dnipri-ta-oblasti-ogolosili-shtormove-poperedzhennya.htm as https://www.obozrevatel.com/ukr/dnipro/city/u-dnipri-ta-oblasti-ogolosili-shtormove-poperedzhennya.htm
2020-10-06 19:41:46.191 c.d.s.b.JSoupParserBolt Thread-14-parse-executor[6 6] [INFO] Parsing : starting https://www.obozrevatel.com/ukr/dnipro/city/u-dnipri-ta-oblasti-ogolosili-shtormove-poperedzhennya/amp.htm
2020-10-06 19:41:46.202 c.d.s.b.JSoupParserBolt Thread-14-parse-executor[6 6] [INFO] Parsed https://www.obozrevatel.com/ukr/dnipro/city/u-dnipri-ta-oblasti-ogolosili-shtormove-poperedzhennya/amp.htm in 7 msec
2020-10-06 19:41:46.212 c.d.s.b.JSoupParserBolt Thread-14-parse-executor[6 6] [INFO] Total for https://www.obozrevatel.com/ukr/dnipro/city/u-dnipri-ta-oblasti-ogolosili-shtormove-poperedzhennya/amp.htm - 17 msec
2020-10-06 19:41:46.215 c.d.s.e.b.IndexerBolt Thread-10-index-executor[3 3] [INFO] Indexing https://www.obozrevatel.com/ukr/dnipro/city/u-dnipri-ta-oblasti-ogolosili-shtormove-poperedzhennya/amp.htm as https://www.obozrevatel.com/ukr/dnipro/city/u-dnipri-ta-oblasti-ogolosili-shtormove-poperedzhennya.htm
2020-10-06 19:41:46.754 c.d.s.e.b.IndexerBolt I/O dispatcher 12 [WARN] Could not find unacked tuple for 50571c0ffec7d295bb754b4847bdf2edace07885895ca09e5d459eeddd03c6f7
2020-10-06 19:51:40.108 c.d.s.e.b.IndexerBolt I/O dispatcher 12 [INFO] Bulk response [246] : items 100, waitAck 42, acked 99, failed 0
2020-10-06 19:51:43.985 c.d.s.w.WARCSpout Thread-4-spout-executor[8 8] [ERROR] Failed - unable to replay WARC record of: https://www.obozrevatel.com/ukr/dnipro/city/u-dnipri-ta-oblasti-ogolosili-shtormove-poperedzhennya.htm
Reduced to be more readable:
19:41:44.202 [INFO] Fetched A with status 200
19:41:44.351 [INFO] Fetched B with status 200
19:41:45.608 [INFO] Parsing : starting A
19:41:45.636 [INFO] Parsed A in 23 msec
19:41:45.674 [INFO] Total for A - 61 msec
19:41:45.675 [INFO] Indexing A as A
19:41:46.191 [INFO] Parsing : starting B
19:41:46.202 [INFO] Parsed B in 7 msec
19:41:46.212 [INFO] Total for B - 17 msec
19:41:46.215 [INFO] Indexing B as A
19:41:46.754 [WARN] Could not find unacked tuple for sha256sum(A)
19:51:40.108 [INFO] Bulk response [246] : items 100, waitAck 42, acked 99, failed 0
19:51:43.985 [ERROR] Failed - unable to replay WARC record of: A
Note: there is no prior Bulk response log message, so this means both pages/URLs have been processed in the first bulk. The hash is verified as sha256 hash of A by:
$> echo -n 'https://www.obozrevatel.com/ukr/dnipro/city/u-dnipri-ta-oblasti-ogolosili-shtormove-poperedzhennya.htm' \
| sha256sum
50571c0ffec7d295bb754b4847bdf2edace07885895ca09e5d459eeddd03c6f7 -
Reactions are currently unavailable