Skip to content

Elasticsearch IndexerBolt: tuples with canonical URL may not get acked #832

@sebastian-nagel

Description

@sebastian-nagel

This issue was seen in a topology fed by WARCSpout. The failure of unacked tuples triggered #825. The failure was reproducible: if the topology was run again with the same input an mostly overlapping (but not identical) set of URLs were logged as failed. In addition, the failed URLs are missing in the status index.

A closer analysis showed that pages with canonical URL were involved. One example:

2020-10-06 19:41:44.202 c.d.s.w.WARCSpout Thread-4-spout-executor[8 8] [INFO] Fetched https://www.obozrevatel.com/ukr/dnipro/city/u-dnipri-ta-oblasti-ogolosili-shtormove-poperedzhennya.htm with status 200
2020-10-06 19:41:44.351 c.d.s.w.WARCSpout Thread-4-spout-executor[8 8] [INFO] Fetched https://www.obozrevatel.com/ukr/dnipro/city/u-dnipri-ta-oblasti-ogolosili-shtormove-poperedzhennya/amp.htm with status 200
2020-10-06 19:41:45.608 c.d.s.b.JSoupParserBolt Thread-14-parse-executor[6 6] [INFO] Parsing : starting https://www.obozrevatel.com/ukr/dnipro/city/u-dnipri-ta-oblasti-ogolosili-shtormove-poperedzhennya.htm
2020-10-06 19:41:45.636 c.d.s.b.JSoupParserBolt Thread-14-parse-executor[6 6] [INFO] Parsed https://www.obozrevatel.com/ukr/dnipro/city/u-dnipri-ta-oblasti-ogolosili-shtormove-poperedzhennya.htm in 23 msec
2020-10-06 19:41:45.674 c.d.s.b.JSoupParserBolt Thread-14-parse-executor[6 6] [INFO] Total for https://www.obozrevatel.com/ukr/dnipro/city/u-dnipri-ta-oblasti-ogolosili-shtormove-poperedzhennya.htm - 61 msec
2020-10-06 19:41:45.675 c.d.s.e.b.IndexerBolt Thread-10-index-executor[3 3] [INFO] Indexing https://www.obozrevatel.com/ukr/dnipro/city/u-dnipri-ta-oblasti-ogolosili-shtormove-poperedzhennya.htm as https://www.obozrevatel.com/ukr/dnipro/city/u-dnipri-ta-oblasti-ogolosili-shtormove-poperedzhennya.htm
2020-10-06 19:41:46.191 c.d.s.b.JSoupParserBolt Thread-14-parse-executor[6 6] [INFO] Parsing : starting https://www.obozrevatel.com/ukr/dnipro/city/u-dnipri-ta-oblasti-ogolosili-shtormove-poperedzhennya/amp.htm
2020-10-06 19:41:46.202 c.d.s.b.JSoupParserBolt Thread-14-parse-executor[6 6] [INFO] Parsed https://www.obozrevatel.com/ukr/dnipro/city/u-dnipri-ta-oblasti-ogolosili-shtormove-poperedzhennya/amp.htm in 7 msec
2020-10-06 19:41:46.212 c.d.s.b.JSoupParserBolt Thread-14-parse-executor[6 6] [INFO] Total for https://www.obozrevatel.com/ukr/dnipro/city/u-dnipri-ta-oblasti-ogolosili-shtormove-poperedzhennya/amp.htm - 17 msec
2020-10-06 19:41:46.215 c.d.s.e.b.IndexerBolt Thread-10-index-executor[3 3] [INFO] Indexing https://www.obozrevatel.com/ukr/dnipro/city/u-dnipri-ta-oblasti-ogolosili-shtormove-poperedzhennya/amp.htm as https://www.obozrevatel.com/ukr/dnipro/city/u-dnipri-ta-oblasti-ogolosili-shtormove-poperedzhennya.htm
2020-10-06 19:41:46.754 c.d.s.e.b.IndexerBolt I/O dispatcher 12 [WARN] Could not find unacked tuple for 50571c0ffec7d295bb754b4847bdf2edace07885895ca09e5d459eeddd03c6f7
2020-10-06 19:51:40.108 c.d.s.e.b.IndexerBolt I/O dispatcher 12 [INFO] Bulk response [246] : items 100, waitAck 42, acked 99, failed 0
2020-10-06 19:51:43.985 c.d.s.w.WARCSpout Thread-4-spout-executor[8 8] [ERROR] Failed - unable to replay WARC record of: https://www.obozrevatel.com/ukr/dnipro/city/u-dnipri-ta-oblasti-ogolosili-shtormove-poperedzhennya.htm

Reduced to be more readable:

19:41:44.202 [INFO] Fetched A with status 200
19:41:44.351 [INFO] Fetched B with status 200
19:41:45.608 [INFO] Parsing : starting A
19:41:45.636 [INFO] Parsed A in 23 msec
19:41:45.674 [INFO] Total for A - 61 msec
19:41:45.675 [INFO] Indexing A as A
19:41:46.191 [INFO] Parsing : starting B
19:41:46.202 [INFO] Parsed B in 7 msec
19:41:46.212 [INFO] Total for B - 17 msec
19:41:46.215 [INFO] Indexing B as A
19:41:46.754 [WARN] Could not find unacked tuple for sha256sum(A)
19:51:40.108 [INFO] Bulk response [246] : items 100, waitAck 42, acked 99, failed 0
19:51:43.985 [ERROR] Failed - unable to replay WARC record of: A

Note: there is no prior Bulk response log message, so this means both pages/URLs have been processed in the first bulk. The hash is verified as sha256 hash of A by:

$> echo -n 'https://www.obozrevatel.com/ukr/dnipro/city/u-dnipri-ta-oblasti-ogolosili-shtormove-poperedzhennya.htm' \
    | sha256sum 
50571c0ffec7d295bb754b4847bdf2edace07885895ca09e5d459eeddd03c6f7  -

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions