WARCHdfsBolt forwarding WARC file path to StatusUpdaterBolt #1044

michaeldinzinger · 2023-02-26T17:57:18Z

Hello all,
as far as I understood, the WARCHdfsBolt produces a continous stream of records in WARC format. The resulting WARC files are written into e.g. an S3-compliant storage with respect to some RotationPolicy and FilenameFormat. Regarding the Storm topology, the WARCHdfsBolt is a dead-end and is not emitting any tuples.
However, we are especially interested in the information, in which file (filename) a certain web page / WARC record is written, and we would like to forward this information to the index, e.g. an OpenSearch/Elasticsearch instance.
So that we know in the end: https://stormcrawler.net/faq/ --is_stored_in--> s3://path/to/file/WARC_file_0815.warc.gz
Is this reasonable and technically possible? Probably only when the WARCHdfsBolt emits the corresponding info to the StatusUpdaterBolt and is not a dead-end anymore.

sebastian-nagel · 2023-02-26T20:07:42Z

Hi @michaeldinzinger, this overlaps with #567 and recently I started to explore potential ways to implement a CDX indexer:

the first idea was to send the a tuple with the URL, metadata, WARC file name and WARC record offsets forward in the topology. This seems more elegant because it's on the user to define which bolt consumes the WARC record location. However, looks like it's challenging to implement because the method execute(tuple) in AbstractHdfsBolt is final. I haven't yet explored some "dirty" tricks, such as holding a reference to the collector in the writer. Seems like the HdfsBolt is designed to be dead-end (however, there is nothing about that in the storm-hdfs docs).
the alternative would be to write the CDX file along with the WARC file. This is a viable use case of the HdfsBolt, cf. STORM-1464: Support multiple file outputs storm#1044.

Given that there is a more general interest, I'd continue to explore variant 1 - but I cannot promise when and whether this will be successful. Any suggestions or help are welcome!

michaeldinzinger · 2023-03-02T16:36:29Z

Hello @sebastian-nagel, thank you for your answer!:) Personally, I would really appreciate this, because being aware of the WARC record location is an important (but not central) aspect on our use of the StormCrawler. Therefore, I would also be willed to investigate into this issue someday. What a pity, that the HdfsBolt is constructed as dead-end..

michaeldinzinger · 2023-03-03T10:26:06Z

Another thing that came up on our end regarding this issue:
Besides the before mentioned information
https://stormcrawler.net/faq/ --is_stored_in--> s3://path/to/file/WARC_file_0815.warc.gz
especially the information
s3://path/to/file/WARC_file_0815.warc.gz --was_created_on--> Timestamp.now()
would be good to have.
This is also not possible because
(1) the WARCHdfsBolt is a dead-end, and
(2) information within the StormCrawler topology is only propagated URL-wise, so to say. (that's dangerous half-knowledge from my side)
Am I right with these?

Background of this question is that we want to trigger further processing of the WARC files when the WARC file is completely written. So I'm wondering whether the crawler can provide us with the info "Now WARC file ready".

sebastian-nagel · 2023-03-03T16:21:04Z

Could just check the filesystem for new files from time to time. This seems reasonable since WARC files usually hold several 10,000 records and, consequently, aren't finished too often.

jnioche added question warc labels Sep 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WARCHdfsBolt forwarding WARC file path to StatusUpdaterBolt #1044

WARCHdfsBolt forwarding WARC file path to StatusUpdaterBolt #1044

michaeldinzinger commented Feb 26, 2023

sebastian-nagel commented Feb 26, 2023

michaeldinzinger commented Mar 2, 2023

michaeldinzinger commented Mar 3, 2023

sebastian-nagel commented Mar 3, 2023

WARCHdfsBolt forwarding WARC file path to StatusUpdaterBolt #1044

WARCHdfsBolt forwarding WARC file path to StatusUpdaterBolt #1044

Comments

michaeldinzinger commented Feb 26, 2023

sebastian-nagel commented Feb 26, 2023

michaeldinzinger commented Mar 2, 2023

michaeldinzinger commented Mar 3, 2023

sebastian-nagel commented Mar 3, 2023