You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As discussed in a stackoverflow thread, I tried to create a storm-crawler-archetype 1.10 based project that emits warc files. Unfortunately though, these warc files always appear empty and contain 0 bytes.
I created a repo where the setup is shown. Also I tried to amend the suggestions that @jnioche gave me (Adding a FileTimeSizeRotationPolicy set to rotate every 10 seconds and 10 Kbytes and setting a new CountSyncPolicy(1)) to no avail.
The command I use to run this example is mvn clean package && storm jar target/test-1.0-SNAPSHOT.jar test.CrawlTopology -conf ./crawler-conf.yaml -local
Sidenote:
I also downgraded the test repo case from above to SC 1.8 and Storm 1.2.1 respectively in this branch right here, but couldn't get that to write a proper WARC file either when using time based rotation - these files were 4 Kilobytes in size but appeared to be invalid gzip files with some binary content.
However, when either lowering the filesize threshold to an excessive value like 4 kbyte or using the MemoryStatusUpdater for recursion, valid single page archives started to appear. It seems that flushing behavior might still be somewhat random.
The text was updated successfully, but these errors were encountered:
This uses the RawLocalFileSystem, which unlike the checksum one used by default does a proper sync of the content to the file.
This seems to work with SC 1.8. The latest version of SC is broken and does not generate a proper gzip.
However, when either lowering the filesize threshold to an excessive value like 4 kbyte or using the MemoryStatusUpdater for recursion, valid single page archives started to appear.
This worked because the rotation had time to work as new URLs were coming through and / or the size was low enough.
Have found the cause of the problem and fixed it. This had to do with the compression of the entries. We should now get a valid gzip regardless of whether triggered by a sync or a rotation.
Thanks @keyboardsamurai for reporting it. Please give the fix a try if you can.
As discussed in a stackoverflow thread, I tried to create a storm-crawler-archetype 1.10 based project that emits warc files. Unfortunately though, these warc files always appear empty and contain 0 bytes.
I created a repo where the setup is shown. Also I tried to amend the suggestions that @jnioche gave me (Adding a FileTimeSizeRotationPolicy set to rotate every 10 seconds and 10 Kbytes and setting a new CountSyncPolicy(1)) to no avail.
The command I use to run this example is
mvn clean package && storm jar target/test-1.0-SNAPSHOT.jar test.CrawlTopology -conf ./crawler-conf.yaml -local
Sidenote:
I also downgraded the test repo case from above to SC 1.8 and Storm 1.2.1 respectively in this branch right here, but couldn't get that to write a proper WARC file either when using time based rotation - these files were 4 Kilobytes in size but appeared to be invalid gzip files with some binary content.
However, when either lowering the filesize threshold to an excessive value like 4 kbyte or using the MemoryStatusUpdater for recursion, valid single page archives started to appear. It seems that flushing behavior might still be somewhat random.
The text was updated successfully, but these errors were encountered: