Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid Avro file causes very slow processing #66

Closed
movermeyer opened this issue Dec 7, 2016 · 4 comments
Closed

Invalid Avro file causes very slow processing #66

movermeyer opened this issue Dec 7, 2016 · 4 comments

Comments

@movermeyer
Copy link

@movermeyer movermeyer commented Dec 7, 2016

I have a program that receives binary files that may or may not be Avro encoded.
My idea was to just try to decode the file using fastavro, and if that failed, treat it like a non-Avro file.

try:
    with open("maybe_an_avro_file", 'rb') as fin:
        reader = fastavro.reader(fin)
except Exception:
    #This file cannot be parsed as Avro, handle differently 
    pass

However, there were a few files that caused fastavro to take a very long time trying to read the block count out of the (non-existent) schema header.

For example, a 2M file consumed a CPU for nearly 1 hour before eventually causing a Python OverflowError.

This is not a huge file, but there seems to be something that is not O(n), as it reads the first few bytes rapidly, then the performance rapidly drops.

  File <REDACTED>
    reader = fastavro.reader(input)
  File "<REDACTED>/fastavro/fastavro/reader.py", line 478, in __init__
    self._header = read_data(fo, HEADER_SCHEMA)
  File "<REDACTED>/fastavro/fastavro/reader.py", line 391, in read_data
    return READERS[record_type](fo, writer_schema, reader_schema)
  File "<REDACTED>/fastavro/fastavro/reader.py", line 334, in read_record
    record[field['name']] = read_data(fo, field['type'])
  File "<REDACTED>/fastavro/fastavro/reader.py", line 391, in read_data
    return READERS[record_type](fo, writer_schema, reader_schema)
  File "<REDACTED>/fastavro/fastavro/reader.py", line 273, in read_map
    block_count = read_long(fo)
  File "<REDACTED>/fastavro/fastavro/reader.py", line 162, in read_long
    n |= (b & 0x7F) << shift

This results in a few questions:

  1. Is there a better way to know whether a file is an Avro formatted file, besides attempting to parse it?
  2. Is the non-linearity concerning?
tebeka added a commit that referenced this issue Dec 8, 2016
@tebeka
Copy link
Collaborator

@tebeka tebeka commented Dec 8, 2016

  1. Is there a better way to know whether a file is an Avro formatted file, besides attempting to parse it?

Avro files has a "magic" header which you can check. I've added is_avro function you can use. Soon to be released in 0.12.0

  1. Is the non-linearity concerning?

Yes, will investigate more.

@movermeyer
Copy link
Author

@movermeyer movermeyer commented Dec 8, 2016

Thanks for your speedy response,

One of the files that took a long time was filled entirely with 0xCD repeated for 2 MiB. In that case, it was a bug in an upstream system that produced it, but it might be a handy test file.

@movermeyer
Copy link
Author

@movermeyer movermeyer commented Dec 8, 2016

Further, could we change is_avro to take in a file object instead of a filepath?

That way, I could pass in StringIO objects instead and get the answer, instead of having to write it to a temp file first.

@tebeka
Copy link
Collaborator

@tebeka tebeka commented Dec 9, 2016

Done in 0.12.1, now you can send either a path or file like object.
I'll check with the 0xCD file.

@movermeyer movermeyer closed this Jul 27, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
2 participants