# How to ensure the of the integrity of intermediate files during the re-execution of workflows

* **Difficulty level**: intermediate
* **Time need to lean**: 10 minutes or less
* **Key points**:
  * Signatures are useful in checking the integrity of intermediate files
  * Option `-T` forces the rerun of upstream steps to ensure integrity of intermediate files
  * `traced()` targets are always checked or re-generated
  

## A problem with makefile-style workflows

<div class="bs-callout bs-callout-info" role="alert">
    <h4>Dependency check of workflow steps</h4>
    <p>A workflow step is ready to execute if all input and dependent targets exist.<p>
</div>

Suppose you are working on a project that involves the execution of workflows repeatedly with different input files or parameters. It is possible to lose track of intermediate files and obtain incorrect results.

Let us assume that you created the following workflow with step `analyze` accepting some parameter and write result to a file, and step `summarize` creating a report named `out.md` from the output of step `analyze`. A little trick here is that action `report` accept one or more input files from its `input` parameter and appends their contents to the end of the report.

In [1]:
# remove out.txt if it already exists
!rm -f out.txt
# write a workflow
%save test.sos -f

[analyze]
parameter: par=5
output: 'out.txt'

print(f'Analyze data with parameter {par}')
with open(_output, 'w') as out:
    out.write(f'Result from parameter {par}')

[summarize]
input: 'out.txt'    
print(f'Writing report with input out.txt')
report: input='out.txt', output='out.md'
    ## A summary report

Now, let us perform the analysis and generate a report:

In [2]:
%preview out.md
%runfile test.sos summarize --par 10 -v0

Analyze data with parameter 10
Writing report with input out.txt


Everything looks ok, so you would like to re-run the analysis using another parameter:

In [3]:
%preview out.md
%runfile test.sos summarize --par 20 -v0

Writing report with input out.txt


Do you see what the problem is here? When you run the step `summarize` of the workflow and its input `out.txt` already exist, SoS will simply execute the step so the `analyze` step is not executed again.

## A clean up step

A common solution to this problem is to introduce a special step to clean up intemediate files. In a GNU Make system this involves the introduction of a `clean` target, and use it to remove intermediate files with commands like

```
make target --par 10
make clean
make target --par 20
```

We can do something similar and create a workflow as follows:

In [4]:
%save test1.sos -f

[analyze]
parameter: par=5
output: 'out.txt'

print(f'Analyze data with parameter {par}')
with open(_output, 'w') as out:
    out.write(f'Result from parameter {par}')

[summarize]
input: 'out.txt'    
print(f'Writing report with input out.txt')
report: input='out.txt', output='out.md'
    ## A summary report

[clean]
sh:
   echo Cleaning up
   rm -f out.txt out.md

After the `clean` step, the next `summarize` step works ok.

In [5]:
%preview out.md
%runfile test1.sos clean -v0
%runfile test1.sos summarize --par 20 -v0

Analyze data with parameter 20
Writing report with input out.txt


You can even execute the `clean` step before `summaize` using compound workflow as follows:

In [6]:
%preview out.md
%runfile test1.sos clean+summarize --par 25 -v0

Analyze data with parameter 25
Writing report with input out.txt


## Dependency tracing with option `-T` 

An easier method, perhaps unique to SoS, is to force the workflow engine to trace the dependency of existing files. More specifically, with option `-T` (trace dependency), SoS will check the input and dependent targets and see if they are the result of another step, and rerun the steps even if the targets already exist.

In [7]:
%preview out.md
%runfile test.sos summarize --par 26 -T -v0

Analyze data with parameter 26
Writing report with input out.txt


In addition to avoiding the trouble of a `clean` step, this method is also more performant, because the `analyze` step will be ignored if its signature matches. This can be shown by running the same step twice as follows:

In [8]:
%runfile test.sos summarize --par 26 -T 

Writing report with input out.txt


Because `out.txt` was generated with `par=26`, rerunning the workflow will not reproduce `out.txt`. This is clearly better than the `clean` approach, which will force the re-execution of the `analyze` step.

## Forcing dependency tracing of selected files

<div class="bs-callout bs-callout-primary" role="alert">
    <h4><code>traced</code> target</h4>
    <p>A <code>traced</code> target (e.g. <code>traced('out.txt')</code>) will always be verified or re-generated even if it already exists. The function converts its parameters to a <code>sos_targets</code> object and marks all its targets. Therefore, this function accepts all parameters of <code>sos_targets</code> and you can use it in the format of<p>
    <pre>
    traced('a.txt')
    traced('a.txt', 'b.txt')
    traced(A='a.txt', B='b.txt')
    ...
    </pre>
</div>

The `-T` option is convenient but can be slow if your workflow handles a large amount of files because SoS will need to determine the dependent steps of all input and dependent files. In addition, users of your workflow may still produce erroneous output if they do not know the `-T` option.

If some intermediate files are important and you always want to make sure that they are up to date, you can mark them as `traced` by wrapping them in the `traced` function.

In [9]:
%save test_traced.sos -f

[analyze]
parameter: par=5
output: 'out.txt'

print(f'Analyze data with parameter {par}')
with open(_output, 'w') as out:
    out.write(f'Result from parameter {par}')

[summarize]
input: traced('out.txt')
print(f'Writing report with input out.txt')
report: input='out.txt', output='out.md'
    ## A summary report

[clean]
sh:
   echo Cleaning up
   rm -f out.txt out.md

Now, when you execute this workflow, the `analyze` step will always be executed (or ignored if signature matches) to ensure the integrity of `out.txt`.

In [10]:
%preview out.md
%runfile test_traced.sos summarize --par 3 -v0

Analyze data with parameter 3
Writing report with input out.txt


In [11]:
%preview out.md
%runfile test_traced.sos summarize --par 4 -v0

Analyze data with parameter 4
Writing report with input out.txt


In [12]:
%preview out.md
%runfile test_traced.sos summarize --par 4 -v0

Writing report with input out.txt


Note that the `analyze` step is ignored in the last case so you do not lose any productivity by rerunning the `analyze` step.

## Further reading

* [How to create dependencies between SoS steps](doc/user_guide/step_dependencies.html)