Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTML archive format and implementation #5

Closed
danny0838 opened this issue Jun 22, 2017 · 7 comments
Closed

HTML archive format and implementation #5

danny0838 opened this issue Jun 22, 2017 · 7 comments
Labels
enhancement New feature or request help wanted Extra attention is needed

Comments

@danny0838
Copy link
Owner

danny0838 commented Jun 22, 2017

In version 0.3.0 we use .htz extension as the zipped package of a captured web page. We also implemented a viewer that can directly load a .htz file in the browser, including the direct method (directly open the htz file in Chrome; the user must check "allow file url access") and the indirect method (open the viewer from the toolbar dropdown list and then pick an .htz file). Unfortunately, the viewer requires the requestFileSystem API, which is currently not available in Firefox.

Is there other way to implement the .htz viewer using currently available WebExtension APIs in Firefox? One idea may be the technique that EPUBReader use, but it seems that its source code is obfuscated and is not available for us to study.

We currently consider the best cross-platform HTML archive format to be zip based. Besides current .htz one, another zip based approach is MAFF that MAF addon uses, which, unfortunately, seems to be unmaintained. Besides zips, there are still many types of HTML archive, such as: .mhtml, .warc, .webarchive, or so. Is there other recommended format?

@danny0838 danny0838 added enhancement New feature or request help wanted Extra attention is needed labels Jun 22, 2017
@yfdyh000
Copy link

@danny0838
Copy link
Owner Author

We now move the viewer functionality to another addon Web Archive Viewer. Support for Firefox is implemented.

@danny0838
Copy link
Owner Author

danny0838 commented Jul 25, 2017

We decided to merge the viewer functionality back. (The decision to split the function before was because it had been seemed impossible to implement some functionality in Firefox for Android, but finally those issues are solved and WebScrapBook can basically work on Firefox for Android now)

@vensko
Copy link

vensko commented Nov 15, 2017

Calibre (ebook-viewer.exe, particularly) supports zipped HTML files and uses .htmlz extension. Renamed .htz files work fine, with minor issues. Could you allow to use Calibre's extension, at least optionally?

@danny0838
Copy link
Owner Author

Thank you for the information. In a quick glance it seems that .htmlz is an archive for a ebook and its purpose could be different from .htz. We need a further investigation sometime.

By the way, could you be more specific about what are the "minor issues" you met? This could help us identify them.

@vensko
Copy link

vensko commented Nov 15, 2017

CSS mostly. For instance, https://habrahabr.ru/post/342344/ misses third-party fonts.
original rendering in Firefox 57 https://imgur.com/XezAxWa
Calibre viewer https://imgur.com/JNeEzWS

@danny0838 danny0838 added this to Other formats in Archive Formats Nov 28, 2017
@danny0838
Copy link
Owner Author

danny0838 commented May 26, 2021

This is a very old issue. As there's no promising archive format found, we decide to stick to HTZ, MAFF, and single HTML currently. Conversion between these formats and other formats like MHT, .archive, .warc, etc., may be implemented in PyWebScrapBook in the future, though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
Archive Formats
  
Other formats
Development

No branches or pull requests

3 participants