# How to define a good wine?

Our next example will be on wine. We'll be using the [Wine Quality dataset](https://archive.ics.uci.edu/dataset/186/wine+quality) from UC Irvine Machine Learning Repository[<sup>1</sup>](#fn1). It consists of 4898 samples and 11 features, it's an example dataset that can be used for both regression and classification. You can find the dataset in my GitHub repository as well. We'll use it to figure out what aspects of wine make it "good", and explore some features of Snakemake, such as

- **checkpoints**: to handle dynamically generated files
- **wildcards**: to handle multiple types of regression models

These will help make our workflow more robust and flexible at the same time. Our workflow consists of 5 steps:
1. Preprocess: to clean, scale and split the dataset into train and test sets
2. Dynamic model selection: to dynamically determine which model to train based on the dataset size
3. Train: to train regression model(s) (e.g. Linear Regression, Ridge and Lasso, in this example)
4. Evaluate: to evaluate the model performance using mean squared error (MSE)
5. Visualize: to generate plots comparing model performance

### 1. Checkpoints
- Checkpoints allow Snakemake to dynamically determine inputs for a rule based on outputs generated during runtime. This is useful when the exact inputs for downstream rules are unknown until an upstream rule has executed. In our example, we'll use checkpoints to dynamically select which regression models to train based on the characteristics of the preprocessed training dataset. For example, we'll select `Ridge` and `Lasso` as the models to train if our dataset has more than 1000 training samples. If it's less than that we'll go with `LinearRegression`
- The checkpoint writes the selected models to a file (`output/selected_models.txt` in this case) which is then used as input for the `train` rule.
- The `scripts/select_models.py` looks like this:
    ```python
    import pandas as pd

    # Load preprocessed training data
    X_train = pd.read_csv(snakemake.input[0])

    # Dynamically decide which models to train
    selected_models = []
    if X_train.shape[0] > 1000:  # Example condition: large dataset
        selected_models.extend(["Ridge", "Lasso"])
    else:  # Small dataset
        selected_models.append("LinearRegression")

    # Save selected models to file
    with open(snakemake.output[0], "w") as f:
        for model in selected_models:
            f.write(model + "\n")
    ```

### 2. Wildcards

- In the `train` rule, we use wildcards to dynamically read the list of models from the checkpoints output (`output/selected_models.txt`) and expand `{model}` wildcard accordingly. This can be done using a `lambda` function in the `Snakefile`. 
    ```python
    rule train:
        input:
            X_train="output/X_train.csv",
            y_train="output/y_train.csv"
        output:
            "output/models/{model}.csv"
        params:
            model_list=lambda wildcards, input: open("output/selected_models.txt").read().strip().split()
        script:
            "scripts/train.py"
    ```

- Our script, `scripts/train.py` takes the wildcard as input as follows:
    ```python
    import pandas as pd
    from sklearn.linear_model import LinearRegression, Ridge, Lasso
    import os

    # Load preprocessed data
    X_train = pd.read_csv(snakemake.input[0])
    y_train = pd.read_csv(snakemake.input[1]).values.ravel()

    # Determine model type from wildcard
    model_type = snakemake.wildcards.model

    if model_type == "LinearRegression":
        model = LinearRegression()
    elif model_type == "Ridge":
        model = Ridge(alpha=1.0)
    elif model_type == "Lasso":
        model = Lasso(alpha=0.1)
    else:
        raise ValueError(f"Unknown model type: {model_type}")

    # Train the model
    model.fit(X_train, y_train)

    # Save the model coefficients
    coefficients = pd.DataFrame(model.coef_, columns=["Coefficient"])
    coefficients.to_csv(snakemake.output[0], index=False)
    ```

- To avoid redundancy in the `evaluate` rule, we can use **dynamic wildcards**. Since we already defined the regression models available for the `train` rule, we don't need to redefine it. By using dynamic wildcards, our `evaluate` rule can automatically adapt.
    ```python
    rule evaluate:
        input:
            models=expand("output/models/{{model}}.csv", model=lambda wildcards: open("output/selected_models.txt").read().strip().split()),
            X_test="output/X_test.csv",
            y_test="output/y_test.csv"
        output:
            "output/model_results.csv"
        script:
            "scripts/evaluate.py"
    ```


Let's run our workflow to train a model.

In [1]:
!snakemake --cores 1

[33mBuilding DAG of jobs...[0m
[33mUsing shell: /usr/bin/bash[0m
[33mProvided cores: 1 (use --cores to define parallelism)[0m
[33mRules claiming more threads will be scaled down.[0m
[33mJob stats:
job          count
---------  -------
all              1
evaluate         1
train            2
visualize        1
total            5
[0m
[33mSelect jobs to execute...[0m
[32m[0m
[32m[Fri Jan 10 11:16:16 2025][0m
[32mrule train:
    input: output/X_train.csv, output/y_train.csv
    output: output/models/Lasso.csv
    jobid: 7
    reason: Code has changed since last execution
    wildcards: model=Lasso
    resources: tmpdir=/tmp[0m
[32m[0m
[32m[Fri Jan 10 11:16:18 2025][0m
[32mFinished job 7.[0m
[32m1 of 5 steps (20%) done[0m
[33mSelect jobs to execute...[0m
[32m[0m
[32m[Fri Jan 10 11:16:18 2025][0m
[32mrule train:
    input: output/X_train.csv, output/y_train.csv
    output: output/models/Ridge.csv
    jobid: 6
    reason: Code has changed since last execution


In [1]:
!snakemake --rulegraph --cores 1 | dot -Tsvg > rulegraph.svg

[33mBuilding DAG of jobs...[0m


## References
<span id="fn1">1. </span> Cortez, P., Cerdeira, A., Almeida, F., Matos, T., & Reis, J. 2009. Wine Quality [Dataset]. UCI Machine Learning Repository. https://doi.org/10.24432/C56S3T.