New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add static file options: hard/symlink & only-when-modified #1982

Merged
merged 1 commit into from Jun 22, 2017

Conversation

Projects
None yet
6 participants
@adeverteuil
Contributor

adeverteuil commented Aug 6, 2016

Hello,

I am interested in improving StaticGenerator. Please let me know if a similar effort is already under way, and also bring your recommendations and ideas. I don't have a timeline but I'd really like to have at least a basic working feature in about 2 weeks given the free time that I have.

The problem is when the site content has several large static files, like videos. When generating the site, all the static files are copied, regardless of modification time. Even WRITE_SELECTED doesn't prevent that. For example, the pages and article take 4 seconds to process, but the static files add another almost 30 seconds to it.

I'm thinking of adding a setting STATIC_CHECK_MODIFIED, false by default (the current behaviour). I'm currently working on inserting the logic in the StaticGenerator.generate_context() method.

Once it works well with mtimes, then maybe I can add a STATIC_CHECK_MODIFIED_METHOD settings which would be "mtime" by default, or the name of a hashlib function. But I doubt that computing hashes on two video files (source_path and save_as) would be faster than just copying one over the other.

Then I also want to try symlinks or hardlinks from save_as to source_path. I imagine that would be fast and even space efficient. Do you think of any problems that could arise if the output files are links and not copies?

I also think #1980 is a great idea which could also solve the problem and I might look into it.

Sincerely,
Alexandre de Verteuil

@pxquim

This comment has been minimized.

Show comment
Hide comment
@pxquim

pxquim Jul 26, 2016

Actually, comparing files may be quite faster than just copying over.

  • On SSDs, reading is usually faster than writing, and does not wear the drive cells, which can withstand a limited number of write cycles.
  • On larger hard disks, writing uses larger heads than reading, meaning that writes must actually read, update, and write back.

I considered modifying the output method on articles and pages to only write changed files. I can no longer remember if I dropped the idea, or if it proved to be too much work.

pxquim commented Jul 26, 2016

Actually, comparing files may be quite faster than just copying over.

  • On SSDs, reading is usually faster than writing, and does not wear the drive cells, which can withstand a limited number of write cycles.
  • On larger hard disks, writing uses larger heads than reading, meaning that writes must actually read, update, and write back.

I considered modifying the output method on articles and pages to only write changed files. I can no longer remember if I dropped the idea, or if it proved to be too much work.

@adeverteuil

This comment has been minimized.

Show comment
Hide comment
@adeverteuil

adeverteuil Jul 28, 2016

Contributor

Thanks @pxquim, this is very relevant. It is worth testing then.

I have a working feature for a STATIC_CHECK_MODIFIED setting if someone is interested in looking. I will continue to improve on this branch (and update the documentation) before issuing a pull request.

https://github.com/adeverteuil/pelican/tree/feature_check_static_modified

In StaticGenerator.generate_output(), I changed a call to shutil.copy2() for shutil.copy(). copy2() does a copy() followed by copystat() but with [microsecond resolution timestamps]. Since stating a file gives nanosecond timestamps, my source_is_newer() method failed. Was there a requirement to preserve source file attributes in generate_output()?

Contributor

adeverteuil commented Jul 28, 2016

Thanks @pxquim, this is very relevant. It is worth testing then.

I have a working feature for a STATIC_CHECK_MODIFIED setting if someone is interested in looking. I will continue to improve on this branch (and update the documentation) before issuing a pull request.

https://github.com/adeverteuil/pelican/tree/feature_check_static_modified

In StaticGenerator.generate_output(), I changed a call to shutil.copy2() for shutil.copy(). copy2() does a copy() followed by copystat() but with [microsecond resolution timestamps]. Since stating a file gives nanosecond timestamps, my source_is_newer() method failed. Was there a requirement to preserve source file attributes in generate_output()?

@adeverteuil

This comment has been minimized.

Show comment
Hide comment
@adeverteuil

adeverteuil Jul 28, 2016

Contributor

I have reached a point where features work and it's worth taking a break and testing the code in the field for a few days/weeks. Here's the documentation about the features I added:

STATIC_CREATE_LINKS = False

If the content and output directories are on the same device, then create hard links instead of copying files, which is much faster.

And by faster I mean almost instant.

STATIC_CHECK_MODIFIED = False

If set to True, and STATIC_CREATE_LINKS is False, compare mtimes of content and output files, and only copy content files that are newer than existing output files.

This also is a remarkable improvement in speed. In both cases the default setting does not change the previous behavior.

Possible future improvements include using the existing CHECK_MODIFIED_METHOD setting and creating symbolic links if hard links are not possible (content and output on different devices) but I'm already satisfied with what I have and it's in a consistent and working state now. I'd rather add features only if someone has interest in that.

I did my best to write clean code, with TDD and I wrote documentation. I ran tox and it passed for Python 2.7 and 3.4. Those of you who run the test suite often, do you install all of those Pythons on your system? Do you use your package manager, or compile from source, or use multiple computers/VMs?

Contributor

adeverteuil commented Jul 28, 2016

I have reached a point where features work and it's worth taking a break and testing the code in the field for a few days/weeks. Here's the documentation about the features I added:

STATIC_CREATE_LINKS = False

If the content and output directories are on the same device, then create hard links instead of copying files, which is much faster.

And by faster I mean almost instant.

STATIC_CHECK_MODIFIED = False

If set to True, and STATIC_CREATE_LINKS is False, compare mtimes of content and output files, and only copy content files that are newer than existing output files.

This also is a remarkable improvement in speed. In both cases the default setting does not change the previous behavior.

Possible future improvements include using the existing CHECK_MODIFIED_METHOD setting and creating symbolic links if hard links are not possible (content and output on different devices) but I'm already satisfied with what I have and it's in a consistent and working state now. I'd rather add features only if someone has interest in that.

I did my best to write clean code, with TDD and I wrote documentation. I ran tox and it passed for Python 2.7 and 3.4. Those of you who run the test suite often, do you install all of those Pythons on your system? Do you use your package manager, or compile from source, or use multiple computers/VMs?

@leotrs

This comment has been minimized.

Show comment
Hide comment
@leotrs

leotrs Aug 2, 2016

Really interesting, would like to see this merged.

leotrs commented Aug 2, 2016

Really interesting, would like to see this merged.

@adeverteuil

This comment has been minimized.

Show comment
Hide comment
@adeverteuil

adeverteuil Aug 6, 2016

Contributor

I made a few improvements today. Also the documentation for STATIC_CREATE_LINKS becomes:

Create links instead of copying files. If the content and output directories are on the same device, then create hard links. Falls back to symbolic links if the output directory is on a different filesystem. If symlinks are created, don’t forget to add the -L or --copy-links option to rsync when uploading your site.

I'll squash my commits and issue a pull request shortly. Thank you for your interest @leotrs.

Contributor

adeverteuil commented Aug 6, 2016

I made a few improvements today. Also the documentation for STATIC_CREATE_LINKS becomes:

Create links instead of copying files. If the content and output directories are on the same device, then create hard links. Falls back to symbolic links if the output directory is on a different filesystem. If symlinks are created, don’t forget to add the -L or --copy-links option to rsync when uploading your site.

I'll squash my commits and issue a pull request shortly. Thank you for your interest @leotrs.

adeverteuil added a commit to adeverteuil/pelican that referenced this pull request Aug 6, 2016

Add two STATIC_ settings. Fix #1982
STATIC_CREATE_LINKS = False

Create links instead of copying files. If the content and output
directories are on the same device, then create hard links. Falls
back to symbolic links if the output directory is on a different
filesystem. If symlinks are created, don’t forget to add the -L or
--copy-links option to rsync when uploading your site.

STATIC_CHECK_MODIFIED = False

If set to True, and STATIC_CREATE_LINKS is False, compare mtimes of
content and output files, and only copy content files that are newer
than existing output files.
@adeverteuil

This comment has been minimized.

Show comment
Hide comment
@adeverteuil

adeverteuil Aug 6, 2016

Contributor

Hm, I think something is not right in the continuous integration check for Pythons versions 3.3 and 3.4. They are actually run in Python 2.7, as you can see in the Details above. In the failing jobs logs, you can also see the python version used in the test's virtualenv is 2.7:

$ python --version
Python 2.7.9

Please let me know if there is something I can do.

Contributor

adeverteuil commented Aug 6, 2016

Hm, I think something is not right in the continuous integration check for Pythons versions 3.3 and 3.4. They are actually run in Python 2.7, as you can see in the Details above. In the failing jobs logs, you can also see the python version used in the test's virtualenv is 2.7:

$ python --version
Python 2.7.9

Please let me know if there is something I can do.

@avaris

This comment has been minimized.

Show comment
Hide comment
@avaris

avaris Aug 6, 2016

Member

No, that's the python used to create virtualenvs.

py33 runtests: commands[0] | /home/travis/build/getpelican/pelican/.tox/py33/bin/python --version
Python 3.3.5
Member

avaris commented Aug 6, 2016

No, that's the python used to create virtualenvs.

py33 runtests: commands[0] | /home/travis/build/getpelican/pelican/.tox/py33/bin/python --version
Python 3.3.5
@adeverteuil

This comment has been minimized.

Show comment
Hide comment
@adeverteuil

adeverteuil Aug 11, 2016

Contributor

I learned how to use pyenv and installed the supported Python versions on my computer, then fixed the failing unit test. The rest of this comment is just for curiosity.

I hit a change of behavior between Python 3.4 and 3.5. I didn't find anything about that in the Python changelogs though, but it looks like a bugfix since the documentation of codecs.open() says the "r" mode is the default, so it should work the same with or without it explicitly specified. Did anyone know about this discrepency? Here's a small script to reproduce it, and the output.

# unexplained.py
import codecs
f = open("f", "wt")
f.write("content")
f.close()

f = codecs.open("f")
print(type(f))
f.close()

f = codecs.open("f", "r")
print(type(f))
f.close()
alex@meson:pelican$ for v in 3 4 5; do python3.$v < unexplained.py; done                                                                                     
<class '_io.BufferedReader'>
<class '_io.TextIOWrapper'>
<class '_io.BufferedReader'>
<class '_io.TextIOWrapper'>
<class '_io.TextIOWrapper'>     # <--- This is not like the others
<class '_io.TextIOWrapper'>
alex@meson:pelican$ pyenv versions
  system
* 3.3.5 (set by /home/alex/.pyenv/version)
  3.3.6
* 3.4.2 (set by /home/alex/.pyenv/version)
  3.4.5
* 3.5.0 (set by /home/alex/.pyenv/version)

In Python 3.5, codec.open() without the "r" parameter returns a TextIOWrapper. To fix the problem I added the "r" parameter so the same class of reader is returned in all 3 versions of Python.

Version 3.4.5 works the same as 3.5.0, but 3.3.6 still returns BufferedReader without "r".

Contributor

adeverteuil commented Aug 11, 2016

I learned how to use pyenv and installed the supported Python versions on my computer, then fixed the failing unit test. The rest of this comment is just for curiosity.

I hit a change of behavior between Python 3.4 and 3.5. I didn't find anything about that in the Python changelogs though, but it looks like a bugfix since the documentation of codecs.open() says the "r" mode is the default, so it should work the same with or without it explicitly specified. Did anyone know about this discrepency? Here's a small script to reproduce it, and the output.

# unexplained.py
import codecs
f = open("f", "wt")
f.write("content")
f.close()

f = codecs.open("f")
print(type(f))
f.close()

f = codecs.open("f", "r")
print(type(f))
f.close()
alex@meson:pelican$ for v in 3 4 5; do python3.$v < unexplained.py; done                                                                                     
<class '_io.BufferedReader'>
<class '_io.TextIOWrapper'>
<class '_io.BufferedReader'>
<class '_io.TextIOWrapper'>
<class '_io.TextIOWrapper'>     # <--- This is not like the others
<class '_io.TextIOWrapper'>
alex@meson:pelican$ pyenv versions
  system
* 3.3.5 (set by /home/alex/.pyenv/version)
  3.3.6
* 3.4.2 (set by /home/alex/.pyenv/version)
  3.4.5
* 3.5.0 (set by /home/alex/.pyenv/version)

In Python 3.5, codec.open() without the "r" parameter returns a TextIOWrapper. To fix the problem I added the "r" parameter so the same class of reader is returned in all 3 versions of Python.

Version 3.4.5 works the same as 3.5.0, but 3.3.6 still returns BufferedReader without "r".

@justinmayer

This comment has been minimized.

Show comment
Hide comment
@justinmayer

justinmayer Sep 27, 2016

Member

Any @getpelican/reviewers have a moment to review this pull request?

Member

justinmayer commented Sep 27, 2016

Any @getpelican/reviewers have a moment to review this pull request?

@almet

This looks really good to me, thanks! I've added a few comments here and there about the naming mainly, I think we should be able to merge as soon as they're fixed.

Show outdated Hide outdated docs/settings.rst
Show outdated Hide outdated docs/settings.rst
Show outdated Hide outdated docs/settings.rst
Show outdated Hide outdated docs/settings.rst
Show outdated Hide outdated pelican/generators.py
Show outdated Hide outdated pelican/generators.py
@almet

This comment has been minimized.

Show comment
Hide comment
@almet

almet Oct 5, 2016

Member

Okay cool, we're almost there!

Member

almet commented Oct 5, 2016

Okay cool, we're almost there!

@adeverteuil

This comment has been minimized.

Show comment
Hide comment
@adeverteuil

adeverteuil Nov 20, 2016

Contributor

I had to merge upstream/master into the feature branch to make checks pass again and to resolve conflicts. Should I rebase or squash or is it okay as-is?

Contributor

adeverteuil commented Nov 20, 2016

I had to merge upstream/master into the feature branch to make checks pass again and to resolve conflicts. Should I rebase or squash or is it okay as-is?

@justinmayer

This comment has been minimized.

Show comment
Hide comment
@justinmayer

justinmayer Nov 21, 2016

Member

Once code reviews are marked as approved, we usually request that you rebase & squash your commits.

@almet: Any follow-up comments on this pull request?

Member

justinmayer commented Nov 21, 2016

Once code reviews are marked as approved, we usually request that you rebase & squash your commits.

@almet: Any follow-up comments on this pull request?

@adeverteuil

This comment has been minimized.

Show comment
Hide comment
@adeverteuil

adeverteuil Nov 21, 2016

Contributor

Ok, I understand how to rebase and squash my commits. Is it okay to force push the changes on Github after doing so? Can or should I assume nobody cloned that feature branch from my fork?

I'm 90% sure I understand what I have to do but this is my first contribution to a public project and it goes against everything I read about changing version history once it has been pushed to a public repo.

Contributor

adeverteuil commented Nov 21, 2016

Ok, I understand how to rebase and squash my commits. Is it okay to force push the changes on Github after doing so? Can or should I assume nobody cloned that feature branch from my fork?

I'm 90% sure I understand what I have to do but this is my first contribution to a public project and it goes against everything I read about changing version history once it has been pushed to a public repo.

@adeverteuil

This comment has been minimized.

Show comment
Hide comment
@adeverteuil

adeverteuil Dec 19, 2016

Contributor

Hi, any comments about the changes? There's no rush, but I'm looking forward to some free time during Christmas holidays, so it would be good timing to finish this in the coming weeks. Please also take a moment to address my question in the last comment.

I hope you are well. Thank you for your time and such a great project. Still use it for my blog and it's fun.

Contributor

adeverteuil commented Dec 19, 2016

Hi, any comments about the changes? There's no rush, but I'm looking forward to some free time during Christmas holidays, so it would be good timing to finish this in the coming weeks. Please also take a moment to address my question in the last comment.

I hope you are well. Thank you for your time and such a great project. Still use it for my blog and it's fun.

@almet

almet approved these changes Dec 20, 2016

@almet

This comment has been minimized.

Show comment
Hide comment
@almet

almet Dec 20, 2016

Member

Everything seems good here! Squashing is the way to go, don't be afraid (especially because you're in a branch). This is done in order to keep the history easier to read.

Thanks for your time and additions :)

Member

almet commented Dec 20, 2016

Everything seems good here! Squashing is the way to go, don't be afraid (especially because you're in a branch). This is done in order to keep the history easier to read.

Thanks for your time and additions :)

@justinmayer

This comment has been minimized.

Show comment
Hide comment
@justinmayer

justinmayer Dec 26, 2016

Member

Hi Alexandre. As Alexis suggested, please rebase & squash your commits. Thanks again for all your work on this! (^_^)

Member

justinmayer commented Dec 26, 2016

Hi Alexandre. As Alexis suggested, please rebase & squash your commits. Thanks again for all your work on this! (^_^)

Add two STATIC_ settings. Fix #1982
STATIC_CREATE_LINKS = False

Create links instead of copying files. If the content and output
directories are on the same device, then create hard links. Falls
back to symbolic links if the output directory is on a different
filesystem. If symlinks are created, don’t forget to add the -L or
--copy-links option to rsync when uploading your site.

STATIC_CHECK_IF_MODIFIED = False

If set to True, and STATIC_CREATE_LINKS is False, compare mtimes of
content and output files, and only copy content files that are newer
than existing output files.
@adeverteuil

This comment has been minimized.

Show comment
Hide comment
@adeverteuil

adeverteuil Dec 30, 2016

Contributor

Hi @justinmayer, @almet, here you go: rebased and squashed properly. I hope the users will like it!

Contributor

adeverteuil commented Dec 30, 2016

Hi @justinmayer, @almet, here you go: rebased and squashed properly. I hope the users will like it!

@justinmayer

This comment has been minimized.

Show comment
Hide comment
@justinmayer

justinmayer Jan 3, 2017

Member

Many thanks, Alexandre! Once we ship 3.7.1 (a minor bug-fix release), I'll add this to the 3.8 milestone and merge.

Member

justinmayer commented Jan 3, 2017

Many thanks, Alexandre! Once we ship 3.7.1 (a minor bug-fix release), I'll add this to the 3.8 milestone and merge.

@justinmayer justinmayer added this to the 3.8.0 milestone Apr 18, 2017

@justinmayer

This comment has been minimized.

Show comment
Hide comment
@justinmayer

justinmayer Jun 22, 2017

Member

Many thanks to @adeverteuil for the excellent contribution and to everyone who helped review it. (^_^)

Member

justinmayer commented Jun 22, 2017

Many thanks to @adeverteuil for the excellent contribution and to everyone who helped review it. (^_^)

@justinmayer justinmayer changed the title from Improvements to static files generator to Add static file options: hard/symlink & only-when-modified Jun 22, 2017

@justinmayer justinmayer merged commit d4435ea into getpelican:master Jun 22, 2017

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment