Input from directories and zip files

Hi, I noticed that the `brf2ebrl` command can take input from multiple Braille files and create a single eBraille from the result.  For example if you have one Braille file for each chapter, you can create a single eBraille file for the entire book containing all the chapters.

I was helping an organisation that keeps their Braille books in `zip` files of one `brl` per chapter (I'll come back to `brl` vs `brf` below).  It worked when I unpacked a `zip` to a temporary directory and passed `*.brl` to `brf2ebrl` but I wondered if it would be a good idea to automate this part of the process, adding code to `brf2ebrl` itself to be able to scan directories and `zip` files.  I submitted this as pull request aphtech/brf2ebrl#2 (currently closed pending discussion on this issue first).

Points copied from aphtech/brf2ebrl#2 (with my comments added, please discuss / critique as necessary):

> Does each BRF create a separate eBraille bundle or do all inputs get bundled into a single eBraille file?

I was imagining "all inputs get bundled into a single eBraille file" to mirror the current behaviour of what happens if you put `*.brf` on the command line.  In other words, I was imagining reading a directory (or a zip file) is just an extension of our existing code to expand a wildcard "glob".

> What order do files get detected when scanning directories?

I was imagining "the same order as the current behaviour of expanding a `*.brf` 'glob'" i.e. case-insensitive sort (in the pull request I copied the existing code from glob expansion into processing the directory scan).

> Do we want to use BRL files, these are not formatted Braille

The BRL files I am looking at **are** formatted Braille.  They are formatted to a fixed number of columns, they have page boundaries, they have running headings, they have page numbers.  In other words, they have everything that `brf2ebrl` is designed to detect.

I was somehow under the impression that BRL is formatted Braille and BRF is unformatted Braille.  Have I got this the wrong way around?  And I didn't just invent this myself: I got it from somewhere (although right now I'm not sure exactly where I first heard it), so it seems **not everybody in the world is clear about whether BRL is formatted or not and whether BRF is formatted or not** and this unclarity has led to at least one institution having large numbers of files with the extension BRL that do in fact contain formatted Braille.

We might have to take this legacy situation as "OK, so it doesn't matter what the file extension is, just as long as it contains formatted Braille".  One advantage of the new eBraille standard should be that it's a well-defined standard and we'll no longer have to worry about this situation of people and organisations having different ideas about which extension does what.

> How is file type determined, extension, mimetype, etc?
> How should unknown file types be handled or should they be ignored.

In the pull request I submitted, the code simply assumes that **every** file it encounters is a formatted Braille file.  We might want to extend this to detect and warn about files that cannot be processed.

> We feel these questions may be answered differently by different people and so it probably is best left for the frontend apps (eg. https://github.com/aphtech/Convert2EBRL) to control this sort of stuff instead of putting it in the library and removing the control from frontends.

I'm not convinced that adding more functionality to the library is removing control from frontends, because the frontends don't _**have**_ to use the new functionality if they don't want to.  But there might still be a related issue of not wanting the library to get unnecessarily too complicated for maintenance (this is only a small addition but there may be a cumulative effect if there are many small additions).

>  If this functionality were to be added to the brf2ebrl CLI script, this then should be constrained to only change files in the brf2ebrl.scripts package, thus assuring us of no breakage for frontend apps depending on this library.

OK if you don't mind that the resulting code change would have more lines and look less elegant.  The pull request I submitted uses a Python generator to yield the extracted files in such a way that Python automatically calls the code to delete these files immediately after they have been read by the library.  I could do it by changing only the command-line script instead, but the resulting code would probably have to keep the temporary directories around with explicit clean-up and exception handling cases: I'm unlikely to be able to get it into a few lines as currently.

> Finally, other than the ZIP file example, it is unclear whether this adds anything which cannot already be done with the brf2ebrl script. Are you aware that brf2ebrl allows multiple files to be specified on the command line and these get bundled into a single eBraille file? You then can rely upon wildcards and shell expansion to gather files from multiple directories?

Yes, I did it only for reading zip files.  Reading directories was simply a necessary addition in service of this, because reading zip files means you have to unpack the zip into a temporary directory (unless we change the entire library to be able to work from in-memory representations of the files instead of their filenames on disk, but that would be a bigger change).

I just thought adding the ability to ingest a zip file into `brf2ebrl` itself would save having to do this in a wrapper script.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Input from directories and zip files #1

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Input from directories and zip files #1

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions