Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incremental publishing #4050

Draft
wants to merge 6 commits into
base: develop
Choose a base branch
from
Draft

Incremental publishing #4050

wants to merge 6 commits into from

Conversation

jelovirt
Copy link
Member

@jelovirt jelovirt commented Nov 14, 2022

Description

Incremental publishing.

Cache preprocessing results.

Motivation and Context

How Has This Been Tested?

Type of Changes

  • New feature (non-breaking change which adds functionality)

Documentation and Compatibility

  • What documentation changes are needed for this feature?
    (Provide links to existing documentation topics that will require updates)
  • Will this change affect backwards compatibility or other users' overrides?

Checklist

Signed-off-by: Jarno Elovirta <jarno@elovirta.com>
Add directed graph using adjacent matrix
@jelovirt jelovirt added the feature New feature or request label Nov 14, 2022
@jelovirt jelovirt self-assigned this Nov 14, 2022
Signed-off-by: Jarno Elovirta <jarno@elovirta.com>
@chrispy-snps
Copy link
Contributor

@jelovirt - will this use some kind of SHA of the input files to determine only what needs to be reprocessed? (I have been considering doing something like this in our builds to detect what deliverables must be rebuilt when writers modify the DITA content.)

@kirkilj
Copy link

kirkilj commented Nov 15, 2022

I was having SHA inklings about this as well.

@jelovirt
Copy link
Member Author

jelovirt commented Nov 15, 2022

Incremental list that tracks the implementation, this is not intended to be a detailed waterfall spec how the feature will work.

  1. pass --cache=path/dir
  2. read path/dir/.job.xml to get file infos and get dependency graph
  3. start processing from input document:
    1. check if file has changed:
      1. if file system timestamps in source and cache do not match, mark as changed
      2. otherwise if file contents hash of source and cache do not match, mark as changed
      3. otherwise mark as unchanged
    2. if file has not changed, copy file from cache and mark file as fully processed
    3. else, continue processing normally
  4. run rest of preprocessing
  5. copy processed content from temp directory and .job.xml to cache directory for the next processing round
  6. convert every changed file to HTML

@raducoravu
Copy link
Member

raducoravu commented Nov 15, 2022

Also if at least one map is found to be modified, all incremental caches for all resources are off.

@raducoravu
Copy link
Member

For step (6) are you considering some kind of ditafileset which only enumerates not changed files from the temp folder?
For our WebHelp output, as we need to generate breadcrumbs inside each topic we will probably not benefit of incremental publishing caching in this step, but benefit in the previous steps.

@raducoravu
Copy link
Member

And if a topic is modified, maybe its shortdesc and title have changed, and it's referenced also in a DITA Map, so does this mean the DITA Map needs to be re-processed as well? Along with all other topics linking to the file...

Signed-off-by: Jarno Elovirta <jarno@elovirta.com>
@chrispy-snps
Copy link
Contributor

chrispy-snps commented Dec 25, 2022

I wonder if this approach could be scaled up to implement deliverable-level caching, as described in the last approach here:

dita-users.groups.io > some thoughts on how a DITA/Git CI/CD pipeline could work

For example, maybe the DITA could store deliverable checksums (perhaps by deliverable ID), then it would simply be a matter of doing

dita --project project.xml --cache /path/to/cache

to efficiently update all deliverables, which would be extremely cool.

Regarding the timestamp check - note that it might be useful to disable this in Git-based flows. Git does not store or restore timestamps. If I temporarily change branches or make then discard changes, timestamps can differ even when contents do not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants