Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Malformed HTTP headers lead to "ValueError: need more than 1 value to unpack" crash #19

Open
JustAnotherArchivist opened this issue Apr 27, 2019 · 1 comment
Labels

Comments

@JustAnotherArchivist
Copy link

I have a WARC which contains an HTTP response whose headers are malformed. Specifically, it's from http://www.assoc-amazon.com/s/link-enhancer?tag=discount039-20&o=1 and this is the data returned:

HTTP/1.1 302 
Content-Type: text/html
nnCoection: close
Content-Length: 0
Location: //wms.assoc-amazon.com/20070822/US/js/link-enhancer-common.js?tag=discount039-20
Cache-Control: no-cache
Pragma: no-cache

More precisely, in Python repr notation:

b'HTTP/1.1 302 \nContent-Type: text/html\nnnCoection: close\nContent-Length: 0\nLocation: //wms.assoc-amazon.com/20070822/US/js/link-enhancer-common.js?tag=discount039-20\nCache-Control: no-cache\nPragma: no-cache\n\n'

There are several issues with this response, but the main one and the one causing trouble is that the line endings are just LF, not CRLF. This causes warcat verify to crash with the following traceback:

Traceback (most recent call last):
  File "/usr/lib/python3.4/runpy.py", line 170, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.4/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File ".../lib/python3.4/site-packages/warcat/__main__.py", line 154, in <module>
    main()
  File ".../lib/python3.4/site-packages/warcat/__main__.py", line 70, in main
    command_info[1](args)
  File ".../lib/python3.4/site-packages/warcat/__main__.py", line 136, in verify_command
    tool.process()
  File ".../lib/python3.4/site-packages/warcat/tool.py", line 95, in process
    check_block_length=self.check_block_length)
  File ".../lib/python3.4/site-packages/warcat/model/warc.py", line 75, in read_record
    check_block_length=check_block_length)
  File ".../lib/python3.4/site-packages/warcat/model/record.py", line 68, in load
    content_type)
  File ".../lib/python3.4/site-packages/warcat/model/block.py", line 21, in load
    field_cls=HTTPHeader)
  File ".../lib/python3.4/site-packages/warcat/model/block.py", line 92, in load
    fields = field_cls.parse(file_obj.read(field_length).decode())
  File ".../lib/python3.4/site-packages/warcat/model/field.py", line 215, in parse
    http_headers.status, s = s.split(newline, 1)
ValueError: need more than 1 value to unpack
@chfoo chfoo added the bug label Dec 11, 2019
@catharsis71
Copy link

I'm getting a similar but slightly different error

$ python3 -m warcat extract 195.242.99.71-8181-2016-03-23-3324e7c6-00000.warc.gz --output-dir output2 --progress --keep-going
47900 | =  | Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/username/.local/lib/python3.8/site-packages/warcat/__main__.py", line 154, in <module>
    main()
  File "/home/username/.local/lib/python3.8/site-packages/warcat/__main__.py", line 70, in main
    command_info[1](args)
  File "/home/username/.local/lib/python3.8/site-packages/warcat/__main__.py", line 131, in extract_command
    tool.process()
  File "/home/username/.local/lib/python3.8/site-packages/warcat/tool.py", line 93, in process
    record, has_more = model.WARC.read_record(f,
  File "/home/username/.local/lib/python3.8/site-packages/warcat/model/warc.py", line 74, in read_record
    record = Record.load(file_object, preserve_block=preserve_block,
  File "/home/username/.local/lib/python3.8/site-packages/warcat/model/record.py", line 67, in load
    record.content_block = ContentBlock.load(file_obj, block_length,
  File "/home/username/.local/lib/python3.8/site-packages/warcat/model/block.py", line 20, in load
    return BlockWithPayload.load(file_obj, length,
  File "/home/username/.local/lib/python3.8/site-packages/warcat/model/block.py", line 94, in load
    fields = field_cls.parse(field_str)
  File "/home/username/.local/lib/python3.8/site-packages/warcat/model/field.py", line 215, in parse
    http_headers.status, s = s.split(newline, 1)
ValueError: not enough values to unpack (expected 2, got 1)

Any possible workarounds for this or any way to remove the bad data from the file so the rest can be unpacked? I tried the --keep-going option but it doesn't seem to help

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants