Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WARC splitting feature #26

Closed
ivbeg opened this issue Jan 31, 2014 · 2 comments
Closed

WARC splitting feature #26

ivbeg opened this issue Jan 31, 2014 · 2 comments

Comments

@ivbeg
Copy link

ivbeg commented Jan 31, 2014

Some cloud storages have limitations on file size. For example Rackspace Cloud Files do not allow files more than 3GB.
It would be great to have WARC splitting feature as part of website crawler.

chfoo added a commit that referenced this issue Mar 12, 2014
util.py: Adds truncate_file().

Re: #26
@chfoo chfoo self-assigned this Mar 13, 2014
@chfoo
Copy link
Member

chfoo commented Mar 13, 2014

--warc-max-size NUM is implemented in v0.25 but the WARC files will be bigger than NUM bytes especially on large files. Using WARC record segmentation (described in section 7) would fix this.

@chfoo chfoo removed their assignment Mar 15, 2014
@chfoo
Copy link
Member

chfoo commented Apr 8, 2014

I think WARC record segmentation would not be helpful. Wpull requires using temporary files and if the temporary file is too large, then Wpull cannot construct the next WARC record.

--warc-max-size can still be used, but managing files on the filesystem is out of scope of Wpull. Feel free to reopen this issue if it is still a problem.

@chfoo chfoo closed this as completed Apr 8, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants