Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

signal when a post is about to be compiled OR query which posts need rebuild #3293

Closed
Xeverous opened this issue Aug 19, 2019 · 14 comments
Closed

Comments

@Xeverous
Copy link

I have a need to perform various actions per post (check whether metadata is correct, generate breadcrumb, generate prev/next buttons, etc) that could be able to read and alter it's metadata (or possibly other attributes).

I can do this after scanned signal, but this signal is fired once, after scanning, not for each (re)built post. That would mean running the same code for every page/post every time files are scanned - that would be a lot of redundant operations, even in an incremental build because AFAIK there is no public API to check whether a post is out of date (== needs a rebuild).

I want to be able to hook a plugin that will be called every time a post is about to be compiled. This would allow similar functionality to shortcodes, but instead of working on data provided by an extension inside the post the plugin would work on the data provided by the post object itself.

If there is actually an API (or could be made) that allows to see which posts/pages are new/out-of-date and need a rebuild - that would be great. Looking at the sitemap plugin implementation it looks like it's using os.walk and manually detects changes in the repository because there is no way to inform Nikola that some task needs to run "every time a page/post is added/removed".

@Kwpolska
Copy link
Member

You might be able to get this information straight from the source, i.e. from doit.

@Xeverous
Copy link
Author

When exactly? Nikola does not document the order of all build steps anywhere. I'm not sure when doit files are generated or updated.

@ralsina
Copy link
Member

ralsina commented Aug 20, 2019

Basically Nikola doesn't determine the order of build steps.

Nikola creates a dependency graph, and that graph is passed onto doit, which then resolves it and plans out how to execute all the tasks. Since it also supports parallelization, there isn't really a task order per se, since some tasks order may be indifferent and may change from run to run.

In general you can't know when a "post needs a rebuild" because in some cases it may be triggered by other reasons. For example, if you add a new post, the previously-newest will be rebuilt, since before it had no "next post" and now it does. Same for all tags matching the new post, and so on.

The good news (I think) is that this is not really all that important. If checking metadata is relatively fast, you could just do it as often as you want. If a post is modified in a way that Nikola "sees" then in the next pass it will be rebuilt. If it's not, it won't.

So, I would do this independently. If you want you can do it as a nikola plugin anyway, just make it a command plugin, and then do something like

nikola check_metadata && nikola build

I have done some plugins that work this way to generate posts out of other sources, such as continuous_import (it imports my goodreads feed) and planetoid (a sort of "RSS planet" implementation)

If checking metadata is expensive, OTOH, maybe you can do it for things that have been modified since the last check:

Create a task plugin that depends on the post file and generates the post file itself (with the possible updated metadata). I am not 100% sure this would actually work but it may be worth trying.

Alternatively, you can get information about which tasks are and are not up-to-date from doit's database. No idea how, but doit does it :-)

$ nikola list                                                                                                                                   
Scanning posts........done!
copy_assets         Copy theme assets into output.
copy_files          Copy static files into the output folder.
create_bundles      Bundle assets.

[other tasks]

$ nikola info

render_posts

Build HTML fragments from metadata and text.

status     : run
 * The task has no dependencies.

task_dep   : 
 - render_galleries
 - render_posts:timeline_changes
 - render_posts:cache/posts/episodio-5-muchos-pythons.html
 - render_posts:cache/posts/macri-en-patineta-en-un-puente.html
 - render_posts:cache/posts/episodio-4-cuidate-nene.html

[many many more subtasks]

$ nikola info  render_posts:cache/posts/episodio-5-muchos-pythons.html

render_posts:cache/posts/episodio-5-muchos-pythons.html

status     : up-to-date

file_dep   : 
 - posts/episodio-5-muchos-pythons.md

task_dep   : 
 - render_posts:timeline_changes
 - render_galleries

targets    : 
 - cache/posts/episodio-5-muchos-pythons.html
 - cache/posts/episodio-5-muchos-pythons.html.dep

If you use the --backend=json option for nikola, .doit.db is easier to read.

So, using jq to query it I see something like this:

  "render_posts:cache/pages/path_handlers.html": {
    "_values_:": {
      "_config_changed:nikola.plugins.task.posts": "0bd67ec41150a72dfc2283cf8d1f08de",
      "_config_changed:nikola.post.Post.deps_uptodate:compiler:pages/path_handlers.rst": "4cf4e27a5f20d721bf5496e1cc0c67c8"
    },
    "checker:": "MD5Checker",
    "pages/path_handlers.rst": [
      1525563906.009334,
      4878,
      "6aebfa666e519b982788ba79d9f57d83"
    ],
    "deps:": [
      "pages/path_handlers.rst"
    ]
  },

That "6aeb..." there is the MD5 of pages/path_handlers.rst the last time nikola ran. So, it should be possible to know if a file is "dirty" totally outside Nikola as well, but it looks complicated and time consuming.

@Xeverous
Copy link
Author

Basically Nikola doesn't determine the order of build steps.

Nikola creates a dependency graph, and that graph is passed onto doit, which then resolves it and plans out how to execute all the tasks. Since it also supports parallelization, there isn't really a task order per se, since some tasks order may be indifferent and may change from run to run.

This. This really cleared a lot to me. Every other static site generator I have worked with, had a fixed build order and any parallelization was at most per task of the same build step. I was surprised that Nikola has no "build order and dependency order" documentation, but now I get it.

In general you can't know when a "post needs a rebuild" because in some cases it may be triggered by other reasons. For example, if you add a new post, the previously-newest will be rebuilt, since before it had no "next post" and now it does. Same for all tags matching the new post, and so on.

If a post is modified in a way that Nikola "sees" then in the next pass it will be rebuilt. If it's not, it won't.

Is there any API which allows to mark certain files as out-of-date? I know that shortcode plugins can return a list of file dependencies, which will trigger rebuild if some shortcode-dependency is newer. You wrote that previously-newest post will be rebuild because it now has next post - how is it detected? Does Nikola simply check file write dates and searches for metadata changes?

I was wondering how post-list or sitemap spec their dependenies but apparently they do not. I have seen that os.walk is used inside sitemap plugin implementation which means that there is no way to outdate some task based on the fact of new post/page. Sitemap plugin has its own cache and scans files outside Nikola. I have a similar problem where I want an index page which needs to be rebuild every time a new post/page is added.

@ralsina
Copy link
Member

ralsina commented Aug 20, 2019

Is there any API which allows to mark certain files as out-of-date?

Good question! AFAIK no. There are APIs to allow you to mark a specific task or output file as needing to be redone.

For example. All these do nothing because everything is up to date:

$ nikola build                                                                                                                             
Scanning posts........done!
$ nikola build output/index.html                                                                                                             
Scanning posts........done!

But using -a forces a build of all that's needed for whatever I ask for:

$ nikola build -a output/index.html                                                                                                         Scanning posts........done!
.  render_posts:timeline_changes
.  render_posts:cache/posts/1.html
.  render_taxonomies:output/index.html

You wrote that previously-newest post will be rebuild because it now has next post - how is it detected?

I would need to check the code... in the generic_page_renderer function there is this code:

        deps_dict = {}
        if post.prev_post:
            deps_dict['PREV_LINK'] = [post.prev_post.permalink(lang)]
        if post.next_post:
            deps_dict['NEXT_LINK'] = [post.next_post.permalink(lang)]

That dictionary basically ends as a dependency of the task. If its contents change, the task is out of date. So, when a post gets a "next_post" it changes, and is dirty.

Does Nikola simply check file write dates and searches for metadata changes?

Some "metadata" is calculated, like post.prev_post and post.next_post and those are handled like mentioned before. In general, metadata is extracted in different ways while posts are scanned, and then it's all considered as part of a post's "signature", so to speak.

This is AFAICS, unconditional and always done.

The mechanism to know if a file is "dirty" is inherited from doit. It can use timestamps or hashes of file contents (IIRC the default is content hashes)

nikola build --help shows this:

  --check_file_uptodate=ARG  Choose how to check if files have been modified.
                            Available options [default: md5]:
                              'md5': use the md5sum
                              'timestamp': use the timestamp

@Xeverous
Copy link
Author

Some "metadata" is calculated, like post.prev_post and post.next_post and those are handled like mentioned before. In general, metadata is extracted in different ways while posts are scanned, and then it's all considered as part of a post's "signature", so to speak.

Do you have any recommendation how should I implement per-post plugins then?

I have a need to add various extra info to the post object (possibly to it's metadata) but I wan to do it from the plugin side. This information is then used inside HTML templates. I could provide all of this info manually inside posts (eg as manually written shortcodes or metadata) but its both unwanted manual work and error-prone work.

I can register all the work on scanned signal and just perform everything for every post but I'm afraid that it would invalidate all post objects and cause everything to be rebuild. Looking at this doc the only way to pass some data to HTML templates is through the Post object.

@ralsina
Copy link
Member

ralsina commented Aug 20, 2019

I am not sure I understand the use case.

So, you want to modify post's metadata. Ok. You can do that. For example the update_metadata plugin does it.

But you want to modify it when what happens?

Is it "once a day"? Is it "every time someone modifies this database"? The trigger for the modification is important, I think.

Just modify whatever posts need modifying, and then "nikola build" will build exactly as much as needed.

@Xeverous
Copy link
Author

Xeverous commented Aug 20, 2019

The trigger for the modification is important, I think.

Indeed. The worst thing would be to introduce a circular dependency or something that always triggers a full rebuild.

Just modify whatever posts need modifying, and then "nikola build" will build exactly as much as needed.

I'm wondering whether Nikola will detect everything correctly

  • On one side, there is a risk that my plugin will change post's metadata but the post will not be rebuild.
  • On the other side, If I'm going to update/recheck everything after all post objects have been scanned, it might accidentally cause a rebuild of all posts

But you want to modify it when what happens?

My current triggers:

  • my own shortcodes: this is easy, because shortcodes return HTML data and a list of file dependencies. My custom shortcodes can use various extra files in the repository and their rewrites will be noticed.
  • index generation shortcode: any page which uses it needs to be rebuild everytime a new page is added or something changes it's path. Sitemap does this by just os.walk + its own cache entires because there is no way to register "everytime a new file appears" only "everytime file X is modified"
  • breadcrumb generation: as above, but limited only to parent directory pages, not all pages
  • prev/next buttons generation: title/path of prev or next articles changes

In short, various posts need to provide extra information for templates. As far as I am aware, the only way to pass extra information to the templates is through adding more attributes to the post object.

The bad thing is that AFAIK metadata is reread everytime a post is scanned. And there is no other way to permanently add something to the post metadata besides writing it manally in the file itself. I'm struggling for any possibility of safely automating that "addition of extra metadata" because while I could do it manually in every post, I want to script it and generate everything from code, given site.timeline. That would be much less work and prevent any mistakes.

@ralsina
Copy link
Member

ralsina commented Aug 20, 2019

I'm wondering whether Nikola will detect everything correctly

It should.

On one side, there is a risk that my plugin will change post's metadata but the post will not be rebuild.

If the metadata is changed on disk, it will be detected. Worse case scenario it will change the hash of the post's file, so it will be reread the next time nikola runs.

On the other side, If I'm going to update/recheck everything after all post objects have been scanned, it might accidentally cause a rebuild of all posts

Only if you do change something on all posts. In which case, well, you do want to rebuild everything, right?

index generation shortcode: any page which uses it needs to be rebuild everytime a new page is added or something changes it's path. Sitemap does this by just os.walk + its own cache entires because there is no way to register "everytime a new file appears" only "everytime file X is modified"

We do have a "magical" dependency for this: If you pretend to depend on a file called "####MAGIC####TIMELINE" then you will cause a rebuild everytime a new post or page is added.

It was added because the post_list plugin needed exactly this. Which makes me wonder if post_list is not what you need, really :-)

breadcrumb generation: as above, but limited only to parent directory pages, not all pages

BTW, just noticed that all posts and pages should have a 'crumbs' variable in their context when you render them. I totally did not remember this.

I'm struggling for any possibility of safely automating that "addition of extra metadata"

We do provide mechanisms to alter metadata safely! Check the upgrade_metadata plugin ... they may not be ideal, but we can always improve them.

Also, just noticed this one: https://plugins.getnikola.com/v8/pretty_breadcrumbs/ ... lets you create crumbs based on metadata instead of in the path of the generated file, which may be handy. I did not remember that one :-)

@Xeverous
Copy link
Author

Only if you do change something on all posts. In which case, well, you do want to rebuild everything, right?

No, I do not want to rebuild the post if there was no change to the metadata.

The problem is this: a post (in it's file form) has never a crumb or prev/next fields specified. They are added by my plugins after all posts are scanned. So every time a post is reread, it has none of these extra attributes and then every time my plugins add these fields. Would this cause an always-rebuild?

I would like to store some data (associated to specific post) permanently. But there is no such thing. At most the key-value cache (would require some effort to serialize stuff into strings).

So right now every time a build ends, the metadata is lost.

If you pretend to depend on a file called "####MAGIC####TIMELINE" then you will cause a rebuild everytime a new post or page is added.
It was added because the post_list plugin needed exactly this. Which makes me wonder if post_list is not what you need, really :-)

Yes I need this and I was indeed confused when I saw this in its implementation. Looked like some cache/filesystem hack and apparently it is so. Now wondering whether it is a stable but udocumented feature.

@ralsina
Copy link
Member

ralsina commented Aug 20, 2019

No, I do not want to rebuild the post if there was no change to the metadata.

If there is no change, the file will not change, and a rebuild will not be triggered. I really don't see the problem.

The problem is this: a post (in it's file form) has never a crumb or prev/next fields specified. They are added by my plugins after all posts are scanned. So every time a post is reread, it has none of these extra attributes and then every time my plugins add these fields. Would this cause an always-rebuild?

Just save those fields in the file. You can have all the metadata you want.

So right now every time a build ends, the metadata is lost.

Ah, so I suspect we are talking past each other. In nikola's context "post metadata" means things that are in the post header, like slug or title or whatever, and they are persistent.

If I used it improperly before to refer to prev_post or next_post that's my bad. Those are properties of the Post objects but not really metadata.

Looked like some cache/filesystem hack and apparently it is so. Now wondering whether it is a stable but udocumented feature.

It has been working for a while, and we don't intend to change it. It's not very well documented because it was a very unusual corner case, really.

@Xeverous
Copy link
Author

Ah, so I suspect we are talking past each other. In nikola's context "post metadata" means things that are in the post header, like slug or title or whatever, and they are persistent.

Just save those fields in the file. You can have all the metadata you want.

True, I can add any data in the post header but I do not want to do this. I want to generate this data from plugins - I hate to have duplicated data and if something can be generated (eg breadcrumbs based on post location) I want it to be generated, not written manually in the post itself.


I have checked the upgrade metadata plugin and this is exactly what I want to avoid. This plugin reads files, changes metadata and then saves these files. It's a good plugin to refactor all files in the repository and then commit the changes. But it's not a good approach to customize website generation - I want the extra information to be available in HTML templates. Specifying this extra info in post header is unwanted because:

  • it generates additional git diffs
  • it duplicates data, because all of it can be inferred from other data
  • manually writing this data in the post file is error-prone
  • manually writing this data in the post file is extra work

For each post, I want to generate some post-specific HTML (eg breadcrumb). Jinja/Mako can not do this directly because it has no access to the site object - only the post object. So the post object needs to provide all the information. Then Jinja/Mako should just use {{ post.breadcrumb() }} or something similar.

So I want my plugins to modify the post object. Add some attributes to each post, that will be available in HTML templates. My plugins will read site.timeline, parse it and apply appropriate changes to every post object.

  • Any extra attributes added to the post object will be lost before next scanning - am I right? Are post objects recreated by rereading files from repository every time new build is triggered? Are Python Post objects saved somewhere?
  • Will adding any attributes to the Post object mark it as out of date?

@ralsina
Copy link
Member

ralsina commented Aug 21, 2019

Then Jinja/Mako should just use {{ post.breadcrumb() }} or something similar.

Doing that is not a great problem, you can monkeypatch Post. But yes, how to make everything that uses that be rebuilt when you change it is the problem we are having.

Any extra attributes added to the post object will be lost before next scanning - am I right? Are post objects recreated by rereading files from repository every time new build is triggered? Are Python Post objects saved somewhere?

They are read from files. That's why I suggest you save the data you want to make persistent to those files.

Will adding any attributes to the Post object mark it as out of date?

I am not sure. There is a Post.__repr__ that tries to do that but I am not 100% sure we are using it. It would be a matter of trying it.


I am very sorry but this thread has taken a whole lot of the very little time I have for these things.
Good luck with it, I have given all the information I have.

@Xeverous
Copy link
Author

They are read from files. That's why I suggest you save the data you want to make persistent to those files.

I really want to avoid clobbering my posts with redundant data (and producing git diffs filled with build artifacts). But I have an idea - I could make my own (git ignored) cache, independent of Nikola and just store any necessary info there.

Looks like I will just experiment with mokeypatching Post objects and see what happens.

@ralsina ralsina closed this as completed Nov 20, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants