Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The option that is supposed to group duplicated pictures doesn't seem to work #578

Closed
abolibibelot1980 opened this issue Dec 26, 2020 · 5 comments

Comments

@abolibibelot1980
Copy link

@abolibibelot1980 abolibibelot1980 commented Dec 26, 2020

Describe the bug
When saving a page which contains several instances of the same picture, each instance gets saved as an individual base64 stream, which can result in huge file sizes — even though the specific option meant to prevent that by replacing such redundant copies by references to the first instance is activated.

To Reproduce
For instance this page :
https://www2.yggtorrent.si/torrent/filmvid%C3%A9o/animation-s%C3%A9rie/553077-dragon+ball+z+int%C3%A9grale+-+broadcast+audio-multi+dvdrip+x264-mirolo
...was saved as a whopping 108MB html file, because each avatar picture got saved as an individual file, and one particular avatar picture from a member who posted many messages on that page (nickname “andiandi”) has a size of 1124220 bytes (in base64).
This even though, as I just verified, the option “regrouper les images dupliquées” (group duplicated pictures) was checked. With WinHex I can verify that there are many strictly identical 1124220 bytes blocks in that file.
As a side note, saved pages don't seem to retain references to the URL of saved images, which would definitely be useful. For instance, I can't find the URL of the aforementioned avatar picture without going back to the online page (apparently the format is PNG which can explain the large size).
Source code looks like this :

<div class=left>
 <a href=https://www2.yggtorrent.si/profile/181057-andiandi><img src="data:image/png;base64,iVBORw0KGgoAAAANSUhEU...

Expected behavior
In this case, as stated in the description for that option, only one instance of each picture in a page should be actually saved as a base64 stream, and any other copy should be saved as a mere reference to the first instance.

Environment

  • Windows 7
  • Firefox 84.0
@gildas-lormeau
Copy link
Owner

@gildas-lormeau gildas-lormeau commented Dec 27, 2020

This option is supposed to detect images that are used multiple times, transform them into background images, and use CSS custom properties to reference them.

If I inspect the image of an avatar that appears more than once in the page, I see that SingleFile worked as expected. You can verify it worked by right clicking on an avatar that is used more than once and select "Inspect Element" in the context menu. This will open the developer tools and select the corresponding <IMG> tag (cf. the highlighted line in blue).

image


If you look at the src attribute of this tag, you'll see that the image is actually a transparent SVG image and the background-image CSS property is set to something similar to var(--sf-img-49).

image


Then, if you scroll down a little bit into this panel, you'll be able to find the value of the custom property --sf-img-49. It's a data URI storing the image.

image


You can also verify the custom property is used multiple times in the Elements view by searching the custom property (with Ctrl-F). For example, there are 29 occurences of the value --sf-img-49.

image

Note that this algorithm is not applied on large images and images that have already some background properties defined. Do you have more details about the data that is duplicated in the HTML page?

Regarding the side note (and maybe the current issue too) I would recommend to take a look at SingleFileZ. It stores the URL of resources as comments in the zip file and in the index.json file that can be found into the root folder of the zip. FYI, the file produced by SingleFileZ weights 12MB.

@gildas-lormeau
Copy link
Owner

@gildas-lormeau gildas-lormeau commented Dec 27, 2020

I think the issue is related to the large images (e.g. the avatar of "Andiandi"). I am forced to define a max width and a max height because Firefox does not support values that are too large for custom properties. As far as I know, this is not really documented so I've chosen safe values. That could probably be optimized.

@gildas-lormeau
Copy link
Owner

@gildas-lormeau gildas-lormeau commented Jan 9, 2021

I'm closing the issue since the option works as expected.

@abolibibelot1980
Copy link
Author

@abolibibelot1980 abolibibelot1980 commented Jan 12, 2021

I don't know if I can still add a new comment although the issue has been marked as “closed”.
Sorry for replying a bit late. Time flies, and flies are hard to catch.

Thanks for the detailed explanations. I don't understand them fully as my knowledge of HTML is very cursory, but I understand that this behavior is due to a compromise made necessary by a limitation in Firefox itself. So, as I understand it, if such large images were replaced by references to the first instance on a given page, the resulting files would not be displayed properly in Firefox, is that what you mean ? Aren't there other methods that could be used to reduce file sizes in such cases ? A possibility would be an option to save images resized to their display size (which would be very small in the case of an avatar), instead of saving the actual source images (only for images which are displayed at a size smaller than their actual size).

FWIW, I've had a similar issue some months earlier with this page, which was saved as a 263MB HTML file :
https://podbay.fm/podcast/966297954
(Likewise, because of a large image displayed multiple times — in this case a 3000x3000px JPG image.)

Is there any way I could edit those files to remove the redundant data, without breaking the code's compliance ? What happens if I simply wipe (using WinHex) the base64 data for all instances of the problematic image, would it be displayed as a blank image, or would the file no longer be loaded properly in Firefox or any compatible utility ? (By the way, is there any standalone utility that can at least view files in “enhanced HTML”, and possibly edit them as well ? With the MHT format, formerly compatible with Firefox, I've used BlockNote which worked quite well, but it doesn't properly display files created by SingleFile, although I still use it for lack of a better option, to view (e)HTML files outside of Firefox.)
I could, more simply, compress such large files with WinRAR or 7-Zip : that 108MB file gets compressed to 9.99MB with WinRAR (RAR5, “Good”, 128MB) and 9.80MB with 7-Zip (LZMA2, “Max”, 64MB, 64, 8T).

Regarding the “side note” part : I've looked into SingleFileZ when I discovered SingleFile, but although there are advantages to saving pages in a compressed format, I prefer to save them in a plain text format, which allows for instance to search keywords inside files (I use Total Commander for such purposes, it does allow to search inside common archive files, but it's much slower), hence why I opted for SingleFile.
As for displaying the origin URL of images : if it's not possible to store both an image's binary content and its URL, perhaps it would be possible to add (optionally) a list of embedded media files with their origin URL in the file's header.
But actually it seems to display them in some cases, I don't quite get how it works : for instance, I saved this very page, and in the resulting file I can find the origin URL to the screenshots you posted above (for instance : <a target=_blank rel="noopener noreferrer" href=https://user-images.githubusercontent.com/396787/103161419-e16a8400-47e1-11eb-9417-f992666213db.png><img src="data:image/png;base64,iVBORw...). How is it different from the aforementioned page ?

Thanks again.

@gildas-lormeau
Copy link
Owner

@gildas-lormeau gildas-lormeau commented Jan 12, 2021

Files produced by SingleFileZ can be indexed, there is an option for that.

Edit: this is a quick answer. I'll answer to the other points after.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
2 participants