Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WARCHdfsBolt writes zero byte files #596

Closed
keyboardsamurai opened this issue Aug 3, 2018 · 3 comments
Closed

WARCHdfsBolt writes zero byte files #596

keyboardsamurai opened this issue Aug 3, 2018 · 3 comments
Milestone

Comments

@keyboardsamurai
Copy link

keyboardsamurai commented Aug 3, 2018

As discussed in a stackoverflow thread, I tried to create a storm-crawler-archetype 1.10 based project that emits warc files. Unfortunately though, these warc files always appear empty and contain 0 bytes.

I created a repo where the setup is shown. Also I tried to amend the suggestions that @jnioche gave me (Adding a FileTimeSizeRotationPolicy set to rotate every 10 seconds and 10 Kbytes and setting a new CountSyncPolicy(1)) to no avail.

The command I use to run this example is mvn clean package && storm jar target/test-1.0-SNAPSHOT.jar test.CrawlTopology -conf ./crawler-conf.yaml -local

Sidenote:
I also downgraded the test repo case from above to SC 1.8 and Storm 1.2.1 respectively in this branch right here, but couldn't get that to write a proper WARC file either when using time based rotation - these files were 4 Kilobytes in size but appeared to be invalid gzip files with some binary content.

However, when either lowering the filesize threshold to an excessive value like 4 kbyte or using the MemoryStatusUpdater for recursion, valid single page archives started to appear. It seems that flushing behavior might still be somewhat random.

@jnioche jnioche added this to the 1.11 milestone Aug 5, 2018
@jnioche
Copy link
Contributor

jnioche commented Sep 4, 2018

Hi @keyboardsamurai to get the sync working you need to configure HDFS like so

        warcbolt.withConfigKey("warc");
        Map<String, Object> hdfsConf = new HashMap<>();
        hdfsConf.put("fs.file.impl", "org.apache.hadoop.fs.RawLocalFileSystem");
        getConf().put("warc", hdfsConf);

This uses the RawLocalFileSystem, which unlike the checksum one used by default does a proper sync of the content to the file.

This seems to work with SC 1.8. The latest version of SC is broken and does not generate a proper gzip.

However, when either lowering the filesize threshold to an excessive value like 4 kbyte or using the MemoryStatusUpdater for recursion, valid single page archives started to appear.

This worked because the rotation had time to work as new URLs were coming through and / or the size was low enough.

@jnioche
Copy link
Contributor

jnioche commented Sep 7, 2018

@sebastian-nagel have you tried the WARC module since the changes you made in 0afe3ed#diff-5332acd41a61ec17dd64f203a6132c33 ?

@jnioche jnioche closed this as completed in 5739c21 Sep 7, 2018
@jnioche
Copy link
Contributor

jnioche commented Sep 7, 2018

Have found the cause of the problem and fixed it. This had to do with the compression of the entries. We should now get a valid gzip regardless of whether triggered by a sync or a rotation.

Thanks @keyboardsamurai for reporting it. Please give the fix a try if you can.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants