Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Product" formats/natures (formats that are both X and Y) #103

Open
julik opened this issue Apr 17, 2018 · 0 comments
Open

"Product" formats/natures (formats that are both X and Y) #103

julik opened this issue Apr 17, 2018 · 0 comments

Comments

@julik
Copy link
Contributor

julik commented Apr 17, 2018

We have formats that parse ambiguously. For example, a Keynote document is a JPEG "at the head" and a ZIP with a specific structure "at the tail". A CR2 is a TIFF until considered otherwise. A TIFF is somewhat CR2-ish until considered otherwise. An Office document is a ZIP initially...

The number of these is only ever going to increase (see the library grounding principles). Currently we are at the stage where we litter the code with workarounds like "if this is also a CR2, bail out", "if this is also a ZIP, it is a Keynote file so bail out..." and so forth. What if, instead of doing this, we were to do the following:

  • Apply all the low level parsers, always
  • Apply some "folder" or "matcher" strategy to the flat list of results. For example, if something is matched as a JPEG and a ZIP and has a specific file structure we can assume it is Keynote. We then take the two results and smash them together into one which states the Keynote file type unambiguously. If we see the Office ZIP filenames in the file we convert the result into a Word file result
  • We return the "folder" list to the caller.

So the procedure would look somewhat like this:

initial_results = parsers.map {|p| p.call(io) } #=> [JPEG, ZIP]
results_with_complex_types = fold_complex_filetypes(initial_results) # => [Keynote]

This does clash with the idea of parsing "at most as many parsers as was requested" but we would get much more intuitive operation in return, and we could remove quite a few hacks.

@julik julik changed the title Idea: "product" formats/natures "Product" formats/natures (formats that are both X and Y) Apr 22, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant