HTML archive format and implementation #5

danny0838 · 2017-06-22T16:29:28Z

In version 0.3.0 we use .htz extension as the zipped package of a captured web page. We also implemented a viewer that can directly load a .htz file in the browser, including the direct method (directly open the htz file in Chrome; the user must check "allow file url access") and the indirect method (open the viewer from the toolbar dropdown list and then pick an .htz file). Unfortunately, the viewer requires the requestFileSystem API, which is currently not available in Firefox.

Is there other way to implement the .htz viewer using currently available WebExtension APIs in Firefox? One idea may be the technique that EPUBReader use, but it seems that its source code is obfuscated and is not available for us to study.

We currently consider the best cross-platform HTML archive format to be zip based. Besides current .htz one, another zip based approach is MAFF that MAF addon uses, which, unfortunately, seems to be unmaintained. Besides zips, there are still many types of HTML archive, such as: .mhtml, .warc, .webarchive, or so. Is there other recommended format?

The text was updated successfully, but these errors were encountered:

yfdyh000 · 2017-06-23T06:20:01Z

FYI: https://developer.mozilla.org/en-US/docs/Web/API/FileSystemDirectoryReader

danny0838 · 2017-07-12T15:52:34Z

We now move the viewer functionality to another addon Web Archive Viewer. Support for Firefox is implemented.

danny0838 · 2017-07-25T18:10:37Z

We decided to merge the viewer functionality back. (The decision to split the function before was because it had been seemed impossible to implement some functionality in Firefox for Android, but finally those issues are solved and WebScrapBook can basically work on Firefox for Android now)

vensko · 2017-11-15T10:21:40Z

Calibre (ebook-viewer.exe, particularly) supports zipped HTML files and uses .htmlz extension. Renamed .htz files work fine, with minor issues. Could you allow to use Calibre's extension, at least optionally?

danny0838 · 2017-11-15T14:13:25Z

Thank you for the information. In a quick glance it seems that .htmlz is an archive for a ebook and its purpose could be different from .htz. We need a further investigation sometime.

By the way, could you be more specific about what are the "minor issues" you met? This could help us identify them.

vensko · 2017-11-15T14:19:37Z

CSS mostly. For instance, https://habrahabr.ru/post/342344/ misses third-party fonts.
original rendering in Firefox 57 https://imgur.com/XezAxWa
Calibre viewer https://imgur.com/JNeEzWS

danny0838 · 2021-05-26T07:54:16Z

This is a very old issue. As there's no promising archive format found, we decide to stick to HTZ, MAFF, and single HTML currently. Conversion between these formats and other formats like MHT, .archive, .warc, etc., may be implemented in PyWebScrapBook in the future, though.

danny0838 added enhancement New feature or request help wanted Extra attention is needed labels Jun 22, 2017

danny0838 added this to Other formats in Archive Formats Nov 28, 2017

danny0838 closed this as completed May 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HTML archive format and implementation #5

HTML archive format and implementation #5

danny0838 commented Jun 22, 2017 •

edited

yfdyh000 commented Jun 23, 2017

danny0838 commented Jul 12, 2017

danny0838 commented Jul 25, 2017 •

edited

vensko commented Nov 15, 2017 •

edited

danny0838 commented Nov 15, 2017

vensko commented Nov 15, 2017 •

edited

danny0838 commented May 26, 2021 •

edited

HTML archive format and implementation #5

HTML archive format and implementation #5

Comments

danny0838 commented Jun 22, 2017 • edited

yfdyh000 commented Jun 23, 2017

danny0838 commented Jul 12, 2017

danny0838 commented Jul 25, 2017 • edited

vensko commented Nov 15, 2017 • edited

danny0838 commented Nov 15, 2017

vensko commented Nov 15, 2017 • edited

danny0838 commented May 26, 2021 • edited

danny0838 commented Jun 22, 2017 •

edited

danny0838 commented Jul 25, 2017 •

edited

vensko commented Nov 15, 2017 •

edited

vensko commented Nov 15, 2017 •

edited

danny0838 commented May 26, 2021 •

edited