# How to define a good wine?

Our next example will be on wine. We'll be using the [Wine Quality dataset](https://archive.ics.uci.edu/dataset/186/wine+quality) from UC Irvine Machine Learning Repository[<sup>1</sup>](#fn1). It consists of 4898 samples and 11 features, it's an example dataset that can be used for both regression and classification. You can find the dataset in my GitHub repository as well. We'll use it to figure out what aspects of wine make it "good", and explore some features of Snakemake, such as

- **checkpoints**: to handle dynamically generated files
- **wildcards**: to handle multiple types of regression models

These will help make our workflow more robust and flexible at the same time. Our workflow consists of 5 steps:
1. Preprocess: to clean, scale and split the dataset into train and test sets
2. Dynamic model selection: to dynamically determine which model to train based on the dataset size
3. Train: to train regression model(s) (e.g. Linear Regression, Ridge and Lasso, in this example)
4. Evaluate: to evaluate the model performance using mean squared error (MSE)
5. Visualize: to generate plots comparing model performance

### 1. Checkpoints
- Checkpoints allow Snakemake to dynamically determine inputs for a rule based on outputs generated during runtime. This is useful when the exact inputs for downstream rules are unknown until an upstream rule has executed. In our example, we'll use checkpoints to dynamically select which regression models to train based on the characteristics of the preprocessed training dataset. For example, we'll select `Ridge` and `Lasso` as the models to train if our dataset has more than 1000 training samples. If it's less than that we'll go with `LinearRegression`
- The checkpoint writes the selected models to a file (`output/selected_models.txt` in this case) which is then used as input for the `evaluate` rule directly.
- The `scripts/select_models.py` looks like this:
    ```python
    import pandas as pd

    # Load preprocessed training data
    X_train = pd.read_csv(snakemake.input[0])

    # Dynamically decide which models to train
    selected_models = []
    if X_train.shape[0] > 1000:  # Example condition: large dataset
        selected_models.extend(["Ridge", "Lasso"])
    else:  # Small dataset
        selected_models.append("LinearRegression")

    # Save selected models to file
    with open(snakemake.output[0], "w") as f:
        for model in selected_models:
            f.write(model + "\n")
    ```

- And the corresponding checkpoint, `select_models` in our Snakefile is:
    ```python
    checkpoint select_models:
        input:
            "output/X_train.csv"
        output:
            "output/selected_models.txt"
        script:
            "scripts/select_models.py"
    ```

### 2. Wildcards

- In the `train` rule, we use the `{model}` wildcard to determine which models to train and save the model coefficients. The wildcard is indirectly set by the `evaluate` rule. To explain this better, we need to look at the `evaluate` rule first. Below is what our `train` rule looks like.
    ```python
    rule evaluate:
        input:
            models=lambda wildcards: expand(
                "output/models/{model}.csv",
                model=open(checkpoints.select_models.get().output[0]).read().strip().split()
            ),
            X_test="output/X_test.csv",
            y_test="output/y_test.csv"
        output:
            "output/model_results.csv"
        script:
            "scripts/evaluate.py"
    ```

- The `scripts/evaluate.py` script takes 3 inputs, first one being the list of models written in the `select_models` checkpoint output. We can retrieve this output with the `get()` function, read its content and store the list of model types in the `{model}` wildcard. We use this wildcard to feed the corresponding model coefficients file path `output/models/{model}.csv`
- To successfully run the `evaluate` rule, Snakemake needs the 3 input files: path to the model coefficients file, test features and test labels. While the test features and labels are produced as output by the `preprocess` rule, we need to run the `train` rule to obtain the model coefficients based on the output file name pattern we defined in the `train` rule.
- When building the DAG of our workflow, Snakemake places the `train` rule before the `evaluate` rule. That's why if we try to run the `train` rule directly, it will fail.

In [9]:
!snakemake --cores 1 train

[33mBuilding DAG of jobs...[0m
[31mWorkflowError:
Target rules may not contain wildcards. Please specify concrete files or a rule without wildcards at the command line, or have a rule without wildcards at the very top of your workflow (e.g. the typical "rule all" which just collects all results you want to generate in the end).[0m


- We cannot place wildcards in the ultimate target rule, because Snakemake cannot figure out what files to produce. Instead, we can either:
1. Write an `all` rule in our `Snakefile` which defines the target the output, Snakemake recognizes the `all` rule name and runs that rule by default if no other rule name is specified at the CLI.
    ```python
    rule all:
        input:
            "output/visualizations/model_performance.png"
    ```

In [10]:
!snakemake --cores 1

[33mBuilding DAG of jobs...[0m
[33mUsing shell: /usr/bin/bash[0m
[33mProvided cores: 1 (use --cores to define parallelism)[0m
[33mRules claiming more threads will be scaled down.[0m
[33mJob stats:
job              count
-------------  -------
all                  1
evaluate             1
preprocess           1
select_models        1
visualize            1
total                5
[0m
[33mSelect jobs to execute...[0m
[32m[0m
[32m[Fri Jan 17 12:30:24 2025][0m
[32mrule preprocess:
    input: data/winequality-red.csv
    output: output/X_train.csv, output/X_test.csv, output/y_train.csv, output/y_test.csv
    jobid: 4
    reason: Missing output files: output/y_test.csv, output/X_train.csv, output/X_test.csv
    resources: tmpdir=/tmp[0m
[32m[0m
[32m[Fri Jan 17 12:30:26 2025][0m
[32mFinished job 4.[0m
[32m1 of 5 steps (20%) done[0m
[33mSelect jobs to execute...[0m
[32m[0m
[32m[Fri Jan 17 12:30:26 2025][0m
[32mcheckpoint select_models:
    input: output/X_train.

2. Or we can explicitly specify the output we want to produce

In [11]:
!snakemake --cores 1 clean_output
!snakemake --cores 1 "output/visualizations/model_performance.png"

[33mBuilding DAG of jobs...[0m
[33mUsing shell: /usr/bin/bash[0m
[33mProvided cores: 1 (use --cores to define parallelism)[0m
[33mRules claiming more threads will be scaled down.[0m
[33mJob stats:
job             count
------------  -------
clean_output        1
total               1
[0m
[33mSelect jobs to execute...[0m
[32m[0m
[32m[Fri Jan 17 12:30:38 2025][0m
[32mrule clean_output:
    jobid: 0
    reason: Rules with neither input nor output files are always executed.
    resources: tmpdir=/tmp[0m
[32m[0m
[32m[Fri Jan 17 12:30:38 2025][0m
[32mFinished job 0.[0m
[32m1 of 1 steps (100%) done[0m
[33mComplete log: .snakemake/log/2025-01-17T123037.397202.snakemake.log[0m
[33mBuilding DAG of jobs...[0m
[33mUsing shell: /usr/bin/bash[0m
[33mProvided cores: 1 (use --cores to define parallelism)[0m
[33mRules claiming more threads will be scaled down.[0m
[33mJob stats:
job              count
-------------  -------
evaluate             1
preprocess          

3. Another option is to run the downstream rules `evaluate` or `visualize`

Let's end this tutorial with a DAG of our workflow

In [12]:
!snakemake --rulegraph --cores 1 | dot -Tsvg > rulegraph.svg

[33mBuilding DAG of jobs...[0m


![](rulegraph.svg)

## References
<span id="fn1">1. </span> Cortez, P., Cerdeira, A., Almeida, F., Matos, T., & Reis, J. 2009. Wine Quality [Dataset]. UCI Machine Learning Repository. https://doi.org/10.24432/C56S3T.