is a Ruby library for prying open video, image, document, and audio files. It includes a number of parser modules that try to recover metadata useful for post-processing and layout while reading the absolute minimum amount of data possible.
Currently supported filetypes:
- DOCX, PPTX, XLSX
...with more on the way!
Pass an IO object that responds to
FormatParser.parse and the first confirmed match will be returned.
match = FormatParser.parse(File.open("myimage.jpg", "rb")) match.nature #=> :image match.format #=> :jpg match.display_width_px #=> 320 match.display_height_px #=> 240 match.orientation #=> :top_left
You can also use
parse_http passing a URL or
parse_file_at passing a path:
match = FormatParser.parse_http('https://upload.wikimedia.org/wikipedia/commons/b/b4/Mardin_1350660_1350692_33_images.jpg') match.nature #=> :image match.format #=> :jpg
If you would rather receive all potential results from the gem, call the gem as follows:
array_of_results = FormatParser.parse(File.open("myimage.jpg", "rb"), results: :all)
You can also optimize the metadata extraction by providing hints to the gem:
FormatParser.parse(File.open("myimage", "rb"), natures: [:video, :image], formats: [:jpg, :png, :mp4], results: :all)
Return values of all parsers have built-in JSON serialization
img_info = FormatParser.parse(File.open("myimage.jpg", "rb")) JSON.pretty_generate(img_info) #=> ...
Creating your own parsers
We need to recover metadata from various file types, and we need to do so satisfying the following constraints:
- The data in those files can be malicious and/or incomplete, so we need to be failsafe
- The data will be fetched from a remote location (S3), so we want to obtain it with as few HTTP requests as possible
- ...and with the amount of data fetched being small - the number of HTTP requests being of greater concern
- The data can be recognized ambiguously and match more than one format definition (like TIFF sections of camera RAW)
- The information necessary is a small subset of the overall metadata available in the file.
- The number of supported formats is only ever going to increase, not decrease
- The library is likely to be used in multiple consumer applications
- The library is likely to be used in multithreading environments
Deliberate design choices
Therefore we adapt the following approaches:
- Modular parsers per file format, with some degree of code sharing between them (but not too much). Adding new formats should be low-friction, and testing these format parsers should be possible in isolation
- Modular and configurable IO stack that supports limiting reads/loops from the source entity.
The IO stack is isolated from the parsers, meaning parsers do not need to care about things
like fetches using
Range:headers, GZIP compression and the like
- A caching system that allows us to ideally fetch once, and only once, and as little as possible - but still accomodate formats that have the important information at the end of the file or might need information from the middle of the file
- Minimal dependencies, and if dependencies are to be used they should be very stable and low-level
- Where possible, use small subsets of full-feature format parsers since we only care about a small subset of the data.
- When a choice arises between using a dependency or writing a small parser, write the small parser since less code is easier to verify and test, and we likely don't care about all the metadata anyway
- Avoid using C libraries which are likely to contain buffer overflows/underflows - we stay memory safe
Unless specified otherwise in this section the fixture files are MIT licensed and from the FastImage and Dimensions projects.
divergent_pixel_dimensions_exif.jpgis used with permission from LiveKom GmbH
extended_reads.jpghas kindly been made available by Raphaelle Pellerin for use exclusively with format_parser
too_many_APP1_markers_surrogate.jpgwas created by the project maintainers
orient_6.jpgis used with permission from Renaud Chaput
- fixture.aiff was created by one of the project maintainers and is MIT licensed
- c_11k16bitpcm.wav and c_8kmp316.wav are from Wikipedia WAV, retrieved January 7, 2018
- c_39064__alienbomb__atmo-truck.wav is from freesound and is CC0 licensed
- c_M1F1-Alaw-AFsp.wav and d_6_Channel_ID.wav are from a McGill Engineering site
- Cassy.mp3 has been produced by WeTransfer and may be used with the library for the purposes of testing
- fixture.fdx was created by one of the project maintainers and is MIT licensed
- DPX files were created by one of the project maintainers and may be used with the library for the purposes of testing
- bmff.mp4 is borrowed from the bmff project
- Test_Circular MOV files were created by one of the project maintainers and are MIT licensed
- CR2 examples are downloaded from http://www.rawsamples.ch/ and are Creative Common Licensed.
- atc_fixture_vbr.flac is a converted version of the MP3 with the same name
- c_11k16btipcm.flac is a converted version of the WAV with the same name
with_garbage_at_the_end.ogghave been generated by the project contributors
- fixture.m4a was created by one of the project maintainers and is MIT licensed
simulator_screenie.pngprovided by Rens Verhoeven
Shinbutsureijoushuincho.tiffis obtained from Wikimedia Commons and is Creative Commons licensed
IMG_9266_*.tifand all it's variations were created by the project maintainers
- The .zip fixture files have been created by the project maintainers
- The .docx files were generated by the project maintainers
JPEG examples of EXIF orientation
- Downloaded from Unspash (and thus freely avaliable) - https://unsplash.com/license and have then been manipulated using the https://github.com/recurser/exif-orientation-examples script.
keynote_recognized_as_jpeg.keyfile was created by the project maintainers
Copyright (c) 2019 WeTransfer.
format_parser is distributed under the conditions of the Hippocratic License
- See LICENSE.txt for further details.