Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WARCHdfsBolt forwarding WARC file path to StatusUpdaterBolt #1044

Open
michaeldinzinger opened this issue Feb 26, 2023 · 4 comments
Open

WARCHdfsBolt forwarding WARC file path to StatusUpdaterBolt #1044

michaeldinzinger opened this issue Feb 26, 2023 · 4 comments

Comments

@michaeldinzinger
Copy link
Contributor

Hello all,
as far as I understood, the WARCHdfsBolt produces a continous stream of records in WARC format. The resulting WARC files are written into e.g. an S3-compliant storage with respect to some RotationPolicy and FilenameFormat. Regarding the Storm topology, the WARCHdfsBolt is a dead-end and is not emitting any tuples.
However, we are especially interested in the information, in which file (filename) a certain web page / WARC record is written, and we would like to forward this information to the index, e.g. an OpenSearch/Elasticsearch instance.
So that we know in the end: https://stormcrawler.net/faq/ --is_stored_in--> s3://path/to/file/WARC_file_0815.warc.gz
Is this reasonable and technically possible? Probably only when the WARCHdfsBolt emits the corresponding info to the StatusUpdaterBolt and is not a dead-end anymore.

@sebastian-nagel
Copy link
Contributor

Hi @michaeldinzinger, this overlaps with #567 and recently I started to explore potential ways to implement a CDX indexer:

  1. the first idea was to send the a tuple with the URL, metadata, WARC file name and WARC record offsets forward in the topology. This seems more elegant because it's on the user to define which bolt consumes the WARC record location. However, looks like it's challenging to implement because the method execute(tuple) in AbstractHdfsBolt is final. I haven't yet explored some "dirty" tricks, such as holding a reference to the collector in the writer. Seems like the HdfsBolt is designed to be dead-end (however, there is nothing about that in the storm-hdfs docs).
  2. the alternative would be to write the CDX file along with the WARC file. This is a viable use case of the HdfsBolt, cf. STORM-1464: Support multiple file outputs storm#1044.

Given that there is a more general interest, I'd continue to explore variant 1 - but I cannot promise when and whether this will be successful. Any suggestions or help are welcome!

@michaeldinzinger
Copy link
Contributor Author

Hello @sebastian-nagel, thank you for your answer!:) Personally, I would really appreciate this, because being aware of the WARC record location is an important (but not central) aspect on our use of the StormCrawler. Therefore, I would also be willed to investigate into this issue someday. What a pity, that the HdfsBolt is constructed as dead-end..

@michaeldinzinger
Copy link
Contributor Author

Another thing that came up on our end regarding this issue:
Besides the before mentioned information
https://stormcrawler.net/faq/ --is_stored_in--> s3://path/to/file/WARC_file_0815.warc.gz
especially the information
s3://path/to/file/WARC_file_0815.warc.gz --was_created_on--> Timestamp.now()
would be good to have.
This is also not possible because
(1) the WARCHdfsBolt is a dead-end, and
(2) information within the StormCrawler topology is only propagated URL-wise, so to say. (that's dangerous half-knowledge from my side)
Am I right with these?

Background of this question is that we want to trigger further processing of the WARC files when the WARC file is completely written. So I'm wondering whether the crawler can provide us with the info "Now WARC file ready".

@sebastian-nagel
Copy link
Contributor

Could just check the filesystem for new files from time to time. This seems reasonable since WARC files usually hold several 10,000 records and, consequently, aren't finished too often.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants