Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large file #35

Closed
thery opened this issue Apr 17, 2021 · 12 comments
Closed

Large file #35

thery opened this issue Apr 17, 2021 · 12 comments

Comments

@thery
Copy link

thery commented Apr 17, 2021

Alectryon seems to generate big files. I realized that when github.io
told me I could not transfer file > 100MB. This file is 228M while its equivalent in Proviola was only 18M
Is it possible to do something?

@cpitclaudel
Copy link
Owner

Yes, there are a few things that we could do, some easier than others:

  • We can record only the first goal, not all the goals, at each step. I think proviola only shows the first goal, right? This makes a large difference when the are many goals and only the first one is being operated on. It's not ideal for code that uses "all:", though.
  • We can store the goals in a separate datastructure and render them dynamically. This is a bit more work, but it would save a lot of space. We would need to add a bit of JS to render serialized goals to HTML, and change alectryon to put the goals in a serialized JS structure instead of prerendered inline. This would save space because there would be less structure and cruft.

To get a best-case estimate, could you try gzip-ing the two files and seeing how large each of them are?

@thery
Copy link
Author

thery commented Apr 18, 2021

You mean the proviola (18M/320K) while the alectryon (228M/7M)?

@cpitclaudel
Copy link
Owner

Yes, absolutely. Thanks! One more experiment: if you enable caching, how large is the resulting cache file, and how much does it compress? (you can enable caching with --cache-directory)

@thery
Copy link
Author

thery commented Apr 18, 2021

Not sure how I do this with firefox.
I did a clear cache before. It says it is 400K. after loading the file it says 1,4M

@cpitclaudel
Copy link
Owner

Oh, sorry: I meant enabling caching in Alectryon: Alectryon can speed up repetitive compilations by saving goals and responses as JSON files.

@thery
Copy link
Author

thery commented Apr 22, 2021

Did the following command

~/git/alectryon/alectryon.py --frontend coqdoc shanoi4.v --cache-directory ./cache/

but the html file is about the same size and there is nothing in the cache directory

@cpitclaudel
Copy link
Owner

cpitclaudel commented Jul 9, 2021

You're right, there is was no support for caching for no-reStructuredText documents. I've added that just now. Can you try again and let me know?

Here's what I see on a mid-sized file:

$ mkdir tmp; cd tmp
$ cp $(opam config var coq:lib)/theories/Lists/List.v ./List.v
$ …/alectryon.py --frontend coqdoc --output-directory . --cache-directory . List.v
$ gzip -9 --keep List.html List.v.cache
$ ls -lh List*
-rw-rw-r-- 1 clement clement 4.0M Jul  8 23:09 List.html
-rw-rw-r-- 1 clement clement 109K Jul  8 23:09 List.html.gz
-rw-r--r-- 1 clement clement  92K Jul  8 23:09 List.v
-rw-rw-r-- 1 clement clement 3.2M Jul  8 23:09 List.v.cache
-rw-rw-r-- 1 clement clement  96K Jul  8 23:09 List.v.cache.gz

So on an example like List.v proper movie files ("caches") are not much smaller than HTML files (3.2MB vs 4.0MB, and the difference is smaller once compressed at 96K vs 109K).

Same conclusions, though with an even more pronounced reduction in size, for a file known to give Alectryon trouble:

$ cp $(opam config var coq:lib)/theories/Reals/Ranalysis3.v ./
$ …/alectryon.py --frontend coqdoc --output-directory . --cache-directory . Ranalysis3.v 
$ gzip -9 --keep Ranalysis3.html Ranalysis3.v.cache 
$ ls -lh Ranalysis3*
-rw-rw-r-- 1 clement clement  25M Jul  8 23:19 Ranalysis3.html
-rw-rw-r-- 1 clement clement 359K Jul  8 23:19 Ranalysis3.html.gz
-rw-r--r-- 1 clement clement  29K Jul  8 23:17 Ranalysis3.v
-rw-rw-r-- 1 clement clement  21M Jul  8 23:18 Ranalysis3.v.cache
-rw-rw-r-- 1 clement clement 278K Jul  8 23:18 Ranalysis3.v.cache.gz

Oh, and if you're curious, here's a comparison of brotli vs gzip:

-rw-rw-r-- 1 clement clement 4.0M Jul  8 23:09 List.html
-rw-rw-r-- 1 clement clement  74K Jul  8 23:09 List.html.br
-rw-rw-r-- 1 clement clement 109K Jul  8 23:09 List.html.gz
-rw-rw-r-- 1 clement clement 3.2M Jul  8 23:09 List.v.cache
-rw-rw-r-- 1 clement clement  59K Jul  8 23:09 List.v.cache.br
-rw-rw-r-- 1 clement clement  96K Jul  8 23:09 List.v.cache.gz
-rw-rw-r-- 1 clement clement  25M Jul  8 23:19 Ranalysis3.html
-rw-rw-r-- 1 clement clement  48K Jul  8 23:19 Ranalysis3.html.br
-rw-rw-r-- 1 clement clement 359K Jul  8 23:19 Ranalysis3.html.gz
-rw-rw-r-- 1 clement clement  21M Jul  8 23:18 Ranalysis3.v.cache
-rw-rw-r-- 1 clement clement  17K Jul  8 23:18 Ranalysis3.v.cache.br
-rw-rw-r-- 1 clement clement 278K Jul  8 23:18 Ranalysis3.v.cache.gz

So it doesn't look like caches as currently implemented help much versus plain HTML in terms of size. Also, it looks like if your server supports compression you're not in too bad a place (especially if it supports brotli, which beatz gzip by 7x in the Ranalysis3 case).

If you need smaller files, e.g. to keep them in a git repo, then caches don't immediately help, because they're not designed for size (they're pretty-printed, among other things). Let's see if we can make them smaller. Here's what happens when you just remove the blanks:

$ python3 -c 'import json, sys;json.dump(json.load(sys.stdin), sys.stdout)' < Ranalysis3.v.cache > Ranalysis3.v.cache.min 
$ python3 -c 'import json, sys;json.dump(json.load(sys.stdin), sys.stdout)' < List.v.cache > List.v.cache.min 
$ ls -lh *.cache* *.html
-rw-rw-r-- 1 clement clement 4.0M Jul  8 23:09 List.html
-rw-rw-r-- 1 clement clement 3.2M Jul  8 23:09 List.v.cache
-rw-rw-r-- 1 clement clement 1.5M Jul  8 23:25 List.v.cache.min
-rw-rw-r-- 1 clement clement  25M Jul  8 23:19 Ranalysis3.html
-rw-rw-r-- 1 clement clement  21M Jul  8 23:18 Ranalysis3.v.cache
-rw-rw-r-- 1 clement clement 9.5M Jul  8 23:25 Ranalysis3.v.cache.min

That's a bit more than 50% off; minifying further, using a less verbose format, we can save about 35% less:

$ ls -lh *.cache* *.html
-rw-rw-r-- 1 clement clement 4.0M Jul  8 23:09 List.html
-rw-rw-r-- 1 clement clement 3.2M Jul  8 23:09 List.v.cache
-rw-rw-r-- 1 clement clement 888K Jul  8 23:32 List.v.cache.min2
-rw-rw-r-- 1 clement clement  25M Jul  8 23:19 Ranalysis3.html
-rw-rw-r-- 1 clement clement  21M Jul  8 23:18 Ranalysis3.v.cache
-rw-rw-r-- 1 clement clement 6.3M Jul  8 23:32 Ranalysis3.v.cache.min2

and finally with a simple sed script to try to gauge further savings I get an extra 25%:

$ ls -lh *.cache* *.html 
-rw-rw-r-- 1 clement clement 4.0M Jul  8 23:09 List.html
-rw-rw-r-- 1 clement clement 3.2M Jul  8 23:09 List.v.cache
-rw-rw-r-- 1 clement clement 522K Jul  8 23:43 List.v.cache.xs
-rw-rw-r-- 1 clement clement  25M Jul  8 23:19 Ranalysis3.html
-rw-rw-r-- 1 clement clement  21M Jul  8 23:18 Ranalysis3.v.cache
-rw-rw-r-- 1 clement clement 4.8M Jul  8 23:44 Ranalysis3.v.cache.xs

So, it would be reasonable, if uncompressed size is a concern, to invest in making smaller movie files. I think we could save about 80% versus uncompressed HTML. We would generate webpages that have only placeholders for the <pre> blocks, and we'd populate them dynamically from a downloaded json file. I could write a simple minifier, but I won't have time to write the webpage loader soon (it should be relatively easy, though.). With additional cleverness we might save space currently lost on repeated strings, like when we have the same subgoal repeated ten times, but I'm not sure if we want to spend the effort.

@cpitclaudel
Copy link
Owner

I've done many more experiments with this. Adding a simple form of deduplication (replacing repeated goals and hypotheses with backreferences) shrinks caches quite a bit:

3.2M List.v.cache
443K List.v.cache.dedup

 21M Ranalysis3.v.cache
352K Ranalysis3.v.cache.dedup    

I've implemented something similar for webpages directly, and it works pretty well, too:

4.4M List.html
1.8M List.min.html

 25M Ranalysis3.html
1.5M Ranalysis3.min.html

(It also makes pages load a lot faster, about 60% in the Ranalysis case).

Finally, I've added an option for cache compression. One thing that's really striking is that XZ and Brotli beat GZip by far on these examples:

 21M Jul  8 23:18 Ranalysis3.v.cache
278K Jul  8 23:18 Ranalysis3.v.cache.gz
 19K Jul  8 23:18 Ranalysis3.v.cache.br

Can you try the implementation that's in the wip branch? It should be enough to pass --cache-directory ., --html-minification and --cache-compression=xz.

cpitclaudel added a commit that referenced this issue Jul 12, 2021
Part of GH-35.  This saves quite a bit of space.
cpitclaudel added a commit that referenced this issue Jul 12, 2021
Some measurements:

    4.4M List.html
    115K List.html.gz
     83K List.html.br

    2.0M List.v.deduplicated-hyp.html
     88K List.v.deduplicated-hyp.html.gz
     78K List.v.deduplicated-hyp.html.br

    1.9M List.v.deduplicated-hyp-hyps.html
     88K List.v.deduplicated-hyp-hyps.html.gz
     77K List.v.deduplicated-hyp-hyps.html.br

    1.8M List.v.deduplicated-hyp-hyps-ccl.html
     88K List.v.deduplicated-hyp-hyps-ccl.html.gz
     78K List.v.deduplicated-hyp-hyps-ccl.html.br

     25M Ranalysis3.html
    359K Ranalysis3.html.gz
     53K Ranalysis3.html.br

    3.0M Ranalysis3.v.deduplicated-hyp.html
     68K Ranalysis3.v.deduplicated-hyp.html.gz
     39K Ranalysis3.v.deduplicated-hyp.html.br

    2.3M Ranalysis3.v.deduplicated-hyp-hyps.html
     63K Ranalysis3.v.deduplicated-hyp-hyps.html.gz
     38K Ranalysis3.v.deduplicated-hyp-hyps.html.br

    1.5M Ranalysis3.v.deduplicated-hyp-hyps-ccl.html
     58K Ranalysis3.v.deduplicated-hyp-hyps-ccl.html.gz
     35K Ranalysis3.v.deduplicated-hyp-hyps-ccl.html.br

Part of GH-35.
cpitclaudel added a commit that referenced this issue Jul 12, 2021
@thery
Copy link
Author

thery commented Jul 12, 2021

sorry I've been busy lately. I will try to test asap (firtunately by the end of the week)

cpitclaudel added a commit that referenced this issue Jul 13, 2021
Some measurements:

    4.4M List.html
    115K List.html.gz
     83K List.html.br

    2.0M List.v.deduplicated-hyp.html
     88K List.v.deduplicated-hyp.html.gz
     78K List.v.deduplicated-hyp.html.br

    1.9M List.v.deduplicated-hyp-hyps.html
     88K List.v.deduplicated-hyp-hyps.html.gz
     77K List.v.deduplicated-hyp-hyps.html.br

    1.8M List.v.deduplicated-hyp-hyps-ccl.html
     88K List.v.deduplicated-hyp-hyps-ccl.html.gz
     78K List.v.deduplicated-hyp-hyps-ccl.html.br

     25M Ranalysis3.html
    359K Ranalysis3.html.gz
     53K Ranalysis3.html.br

    3.0M Ranalysis3.v.deduplicated-hyp.html
     68K Ranalysis3.v.deduplicated-hyp.html.gz
     39K Ranalysis3.v.deduplicated-hyp.html.br

    2.3M Ranalysis3.v.deduplicated-hyp-hyps.html
     63K Ranalysis3.v.deduplicated-hyp-hyps.html.gz
     38K Ranalysis3.v.deduplicated-hyp-hyps.html.br

    1.5M Ranalysis3.v.deduplicated-hyp-hyps-ccl.html
     58K Ranalysis3.v.deduplicated-hyp-hyps-ccl.html.gz
     35K Ranalysis3.v.deduplicated-hyp-hyps-ccl.html.br

Part of GH-35.
cpitclaudel added a commit that referenced this issue Jul 13, 2021
cpitclaudel added a commit that referenced this issue Jul 13, 2021
Part of GH-35.  This saves quite a bit of space.
cpitclaudel added a commit that referenced this issue Jul 13, 2021
Some measurements:

    4.4M List.html
    115K List.html.gz
     83K List.html.br

    2.0M List.v.deduplicated-hyp.html
     88K List.v.deduplicated-hyp.html.gz
     78K List.v.deduplicated-hyp.html.br

    1.9M List.v.deduplicated-hyp-hyps.html
     88K List.v.deduplicated-hyp-hyps.html.gz
     77K List.v.deduplicated-hyp-hyps.html.br

    1.8M List.v.deduplicated-hyp-hyps-ccl.html
     88K List.v.deduplicated-hyp-hyps-ccl.html.gz
     78K List.v.deduplicated-hyp-hyps-ccl.html.br

     25M Ranalysis3.html
    359K Ranalysis3.html.gz
     53K Ranalysis3.html.br

    3.0M Ranalysis3.v.deduplicated-hyp.html
     68K Ranalysis3.v.deduplicated-hyp.html.gz
     39K Ranalysis3.v.deduplicated-hyp.html.br

    2.3M Ranalysis3.v.deduplicated-hyp-hyps.html
     63K Ranalysis3.v.deduplicated-hyp-hyps.html.gz
     38K Ranalysis3.v.deduplicated-hyp-hyps.html.br

    1.5M Ranalysis3.v.deduplicated-hyp-hyps-ccl.html
     58K Ranalysis3.v.deduplicated-hyp-hyps-ccl.html.gz
     35K Ranalysis3.v.deduplicated-hyp-hyps-ccl.html.br

Part of GH-35.
cpitclaudel added a commit that referenced this issue Jul 13, 2021
cpitclaudel added a commit that referenced this issue Jul 15, 2021
Part of GH-35.  This saves quite a bit of space.
cpitclaudel added a commit that referenced this issue Jul 15, 2021
Some measurements:

    4.4M List.html
    115K List.html.gz
     83K List.html.br

    2.0M List.v.deduplicated-hyp.html
     88K List.v.deduplicated-hyp.html.gz
     78K List.v.deduplicated-hyp.html.br

    1.9M List.v.deduplicated-hyp-hyps.html
     88K List.v.deduplicated-hyp-hyps.html.gz
     77K List.v.deduplicated-hyp-hyps.html.br

    1.8M List.v.deduplicated-hyp-hyps-ccl.html
     88K List.v.deduplicated-hyp-hyps-ccl.html.gz
     78K List.v.deduplicated-hyp-hyps-ccl.html.br

     25M Ranalysis3.html
    359K Ranalysis3.html.gz
     53K Ranalysis3.html.br

    3.0M Ranalysis3.v.deduplicated-hyp.html
     68K Ranalysis3.v.deduplicated-hyp.html.gz
     39K Ranalysis3.v.deduplicated-hyp.html.br

    2.3M Ranalysis3.v.deduplicated-hyp-hyps.html
     63K Ranalysis3.v.deduplicated-hyp-hyps.html.gz
     38K Ranalysis3.v.deduplicated-hyp-hyps.html.br

    1.5M Ranalysis3.v.deduplicated-hyp-hyps-ccl.html
     58K Ranalysis3.v.deduplicated-hyp-hyps-ccl.html.gz
     35K Ranalysis3.v.deduplicated-hyp-hyps-ccl.html.br

Part of GH-35.
cpitclaudel added a commit that referenced this issue Jul 15, 2021
cpitclaudel added a commit that referenced this issue Jul 15, 2021
Part of GH-35.  This saves quite a bit of space.
cpitclaudel added a commit that referenced this issue Jul 15, 2021
Some measurements:

    4.4M List.html
    115K List.html.gz
     83K List.html.br

    2.0M List.v.deduplicated-hyp.html
     88K List.v.deduplicated-hyp.html.gz
     78K List.v.deduplicated-hyp.html.br

    1.9M List.v.deduplicated-hyp-hyps.html
     88K List.v.deduplicated-hyp-hyps.html.gz
     77K List.v.deduplicated-hyp-hyps.html.br

    1.8M List.v.deduplicated-hyp-hyps-ccl.html
     88K List.v.deduplicated-hyp-hyps-ccl.html.gz
     78K List.v.deduplicated-hyp-hyps-ccl.html.br

     25M Ranalysis3.html
    359K Ranalysis3.html.gz
     53K Ranalysis3.html.br

    3.0M Ranalysis3.v.deduplicated-hyp.html
     68K Ranalysis3.v.deduplicated-hyp.html.gz
     39K Ranalysis3.v.deduplicated-hyp.html.br

    2.3M Ranalysis3.v.deduplicated-hyp-hyps.html
     63K Ranalysis3.v.deduplicated-hyp-hyps.html.gz
     38K Ranalysis3.v.deduplicated-hyp-hyps.html.br

    1.5M Ranalysis3.v.deduplicated-hyp-hyps-ccl.html
     58K Ranalysis3.v.deduplicated-hyp-hyps-ccl.html.gz
     35K Ranalysis3.v.deduplicated-hyp-hyps-ccl.html.br

Part of GH-35.
cpitclaudel added a commit that referenced this issue Jul 15, 2021
cpitclaudel added a commit that referenced this issue Jul 15, 2021
Part of GH-35.  This saves quite a bit of space.
cpitclaudel added a commit that referenced this issue Jul 15, 2021
Some measurements:

    4.4M List.html
    115K List.html.gz
     83K List.html.br

    2.0M List.v.deduplicated-hyp.html
     88K List.v.deduplicated-hyp.html.gz
     78K List.v.deduplicated-hyp.html.br

    1.9M List.v.deduplicated-hyp-hyps.html
     88K List.v.deduplicated-hyp-hyps.html.gz
     77K List.v.deduplicated-hyp-hyps.html.br

    1.8M List.v.deduplicated-hyp-hyps-ccl.html
     88K List.v.deduplicated-hyp-hyps-ccl.html.gz
     78K List.v.deduplicated-hyp-hyps-ccl.html.br

     25M Ranalysis3.html
    359K Ranalysis3.html.gz
     53K Ranalysis3.html.br

    3.0M Ranalysis3.v.deduplicated-hyp.html
     68K Ranalysis3.v.deduplicated-hyp.html.gz
     39K Ranalysis3.v.deduplicated-hyp.html.br

    2.3M Ranalysis3.v.deduplicated-hyp-hyps.html
     63K Ranalysis3.v.deduplicated-hyp-hyps.html.gz
     38K Ranalysis3.v.deduplicated-hyp-hyps.html.br

    1.5M Ranalysis3.v.deduplicated-hyp-hyps-ccl.html
     58K Ranalysis3.v.deduplicated-hyp-hyps-ccl.html.gz
     35K Ranalysis3.v.deduplicated-hyp-hyps-ccl.html.br

Part of GH-35.
cpitclaudel added a commit that referenced this issue Jul 15, 2021
@cpitclaudel
Copy link
Owner

OK, I've merged the changes; you can use --html-minification. Please try it and we'll refine the feature directly on master.

@thery
Copy link
Author

thery commented Jul 26, 2021

very good! With --html-minification my files have really shrunk. The largest one is now only 3M.
Here is the percent of the size of the old versions with respect to the size of the new ones.

triangular.html	133.00 %
phi.html	147.00 %
extra.html	220.00 %
psi.html	391.00 %
ghanoi3.html	443.00 %
gdist.html	458.00 %
shanoi.html	532.00 %
ghanoi4.html	565.00 %
ghanoi.html	638.00 %
star.html	728.00 %
rhanoi3.html	1616.00 %
lhanoi3.html	2627.00 %
shanoi4.html	7572.00 %
rhanoi4.html	8158.00 %

🎉

@cpitclaudel
Copy link
Owner

Woohoo! :) Thanks a lot for testing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants