New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid Avro file causes very slow processing #66

Closed
movermeyer opened this Issue Dec 7, 2016 · 4 comments

Comments

Projects
None yet
2 participants
@movermeyer

movermeyer commented Dec 7, 2016

I have a program that receives binary files that may or may not be Avro encoded.
My idea was to just try to decode the file using fastavro, and if that failed, treat it like a non-Avro file.

try:
    with open("maybe_an_avro_file", 'rb') as fin:
        reader = fastavro.reader(fin)
except Exception:
    #This file cannot be parsed as Avro, handle differently 
    pass

However, there were a few files that caused fastavro to take a very long time trying to read the block count out of the (non-existent) schema header.

For example, a 2M file consumed a CPU for nearly 1 hour before eventually causing a Python OverflowError.

This is not a huge file, but there seems to be something that is not O(n), as it reads the first few bytes rapidly, then the performance rapidly drops.

  File <REDACTED>
    reader = fastavro.reader(input)
  File "<REDACTED>/fastavro/fastavro/reader.py", line 478, in __init__
    self._header = read_data(fo, HEADER_SCHEMA)
  File "<REDACTED>/fastavro/fastavro/reader.py", line 391, in read_data
    return READERS[record_type](fo, writer_schema, reader_schema)
  File "<REDACTED>/fastavro/fastavro/reader.py", line 334, in read_record
    record[field['name']] = read_data(fo, field['type'])
  File "<REDACTED>/fastavro/fastavro/reader.py", line 391, in read_data
    return READERS[record_type](fo, writer_schema, reader_schema)
  File "<REDACTED>/fastavro/fastavro/reader.py", line 273, in read_map
    block_count = read_long(fo)
  File "<REDACTED>/fastavro/fastavro/reader.py", line 162, in read_long
    n |= (b & 0x7F) << shift

This results in a few questions:

  1. Is there a better way to know whether a file is an Avro formatted file, besides attempting to parse it?
  2. Is the non-linearity concerning?

tebeka added a commit that referenced this issue Dec 8, 2016

@tebeka

This comment has been minimized.

Collaborator

tebeka commented Dec 8, 2016

  1. Is there a better way to know whether a file is an Avro formatted file, besides attempting to parse it?

Avro files has a "magic" header which you can check. I've added is_avro function you can use. Soon to be released in 0.12.0

  1. Is the non-linearity concerning?

Yes, will investigate more.

@movermeyer

This comment has been minimized.

movermeyer commented Dec 8, 2016

Thanks for your speedy response,

One of the files that took a long time was filled entirely with 0xCD repeated for 2 MiB. In that case, it was a bug in an upstream system that produced it, but it might be a handy test file.

@movermeyer

This comment has been minimized.

movermeyer commented Dec 8, 2016

Further, could we change is_avro to take in a file object instead of a filepath?

That way, I could pass in StringIO objects instead and get the answer, instead of having to write it to a temp file first.

@tebeka

This comment has been minimized.

Collaborator

tebeka commented Dec 9, 2016

Done in 0.12.1, now you can send either a path or file like object.
I'll check with the 0xCD file.

@movermeyer movermeyer closed this Jul 27, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment