-
Notifications
You must be signed in to change notification settings - Fork 41
Description
Hello! I see a potential issue with the current interface for get that I think should be solvable. I have noticed the get function the crate seems to be based around only accepts an &[u8]. I'm not as super familiar with Rust at this level yet as I would like to be, so forgive me if I have misunderstood, but that implies I need to read (potentially) the entire file into memory before it can be matched. Looking at the implementation for get_from_file(), this appears to be what that function abstracts.
#58 makes note that all tests seem to pass when test files have been truncated to only 4kb, even if the files themselves are now invalid. DOC/PPT/XLS Seem to be the exceptions as they are parsed rather than read as binary.
Of course, that is a much easier ask, to read only 4kb, but there is no guarantee this will hold into the future, and the crate interface does not make clear that this is all that may be needed to identify a file.
My thought is it would certainly be much better if we could either pass an iterator or Reader of some kind to get instead, allowing us to control the amount of data kept in memory. Even better, it seems like it may be possible (though perhaps not trivial/easy, I cannot say with my current understanding) to add some information such as the "max required size" to each matcher, denoting that the matcher should never need more than X bytes to determine if the file matches or not. Obviously this does not fly for all file types (although, again, the previous pull request seems to imply it works for pretty much all currently supported types), but that can be abstracted behind some enum or other interface for those looking to use the optimization. Perhaps by making the matcher a trait with an associated constant, the default of which is 0 to denote requiring the whole file?