Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Excessive memory usage when loading a WARC with big files #5

Open
bzc6p opened this issue May 28, 2015 · 2 comments
Open

Excessive memory usage when loading a WARC with big files #5

bzc6p opened this issue May 28, 2015 · 2 comments

Comments

@bzc6p
Copy link

bzc6p commented May 28, 2015

I tried to load a WARC with a few larger (200-300 MB) files in it. During the process of loading (indexing) the WARC, memory usage of the python process (that worked on the indexing) increased up to, like, 700 MB, and then ran out of memory, leaving the following error message in the terminal:

Loading /media/datadisk/upload_queue/hajduvolan_hu_2015_05.warc.gz
Exception in thread Thread-1:
Traceback (most recent call last):
  File "/usr/lib/python2.7/threading.py", line 552, in __bootstrap_inner
    self.run()
  File "./warcproxy.py", line 112, in run
    http_response = parse_http_response(record)
  File "./warcproxy.py", line 24, in parse_http_response
    remainder = message.feed(record.content[1])
  File "/home/istvan/warc-proxy/hanzo/httptools/messaging.py", line 576, in feed
    text = HTTPMessage.feed(self, text)
  File "/home/istvan/warc-proxy/hanzo/httptools/messaging.py", line 94, in feed
    text = self.feed_start(text)
  File "/home/istvan/warc-proxy/hanzo/httptools/messaging.py", line 179, in feed_start
    line, text = self.feed_line(text)
  File "/home/istvan/warc-proxy/hanzo/httptools/messaging.py", line 159, in feed_line
    text = str(self.buffer[pos:])
MemoryError

The progress bar stuck, the indexing stopped.
I bet on the big files being responsible for this, as I've been using this great tool for long and haven't experienced such a problem so far (this was the first time that I tried to load a WARC with files larger than a few tens of megabytes). However, I can't imagine why warc-proxy would need 700 MB of mermoy for indexing a 250 MB file.

I think you can easily reproduce the problem: you can find the problematic WARC here: https://archive.org/details/hajduvolan_hu_2015_05. The probably problematic files are http://www.hajduvolan.hu/files/userfiles/Flash/EU_projekt_2010-2012.flv (249 MB) and http://www.hajduvolan.hu/files/userfiles/Flash/EU_projekt.flv (146 MB).

@martinvahi
Copy link

I haven't measured the memory usage, but
I'm writing this comment here to report that it totally FAILS TO LOAD/index
WARC-files that are multiple GiB in size. An example that fails to laoad
MIGHT be available at

http://temporary.softf1.com/2017/bugs/www.tldp.org-2017-01-06-c51e36ac-00000.warc.gz

@bzc6p
Copy link
Author

bzc6p commented Jan 9, 2017

I generally don't have problems with WARCs up to a few gigabytes in size (haven't tried files tens of gigabytes of size, however), only if there are several hundred megabytes files IN the WARC itself.

I've tried your file, and it has been indexed fine here. Maybe check the command-line messages while indexing, in order to get a clue about a missing dependency or some other sort of error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants