## Adding metrics and plots to dvc.yaml

In this exercise, your task is to complete the contents of dvc.yaml that defines a model training workflow.

Here preprocess_dataset.py and train.py are the files that perform data preprocessing and model training by taking weather.csv as input in the raw_dataset folder. As output, the model training code generates a predictions.csv file that contains the predictions and the ground truth, and metrics.json file containing structured metrics data. The former would be used to generate a normalized confusion matrix plot for comparing it with previous commits.

### Ide Exercise Instruction
    - Set the metrics target to the output metrics file.
    - Set the plot target to the output file containing predictions data.
    - Set the plot template to confusion_normalized to plot the normalized confusion matrix.
    - Set the correct value for cache key to track plots in Git repository instead of DVC remote.

In [None]:
dvc.yml
stages:
  preprocess:
    cmd: python3 preprocess_dataset.py
    deps:
    - preprocess_dataset.py
    - raw_dataset/weather.csv
    - utils_and_constants.py
    outs:
    - processed_dataset/weather.csv
  train:
    cmd: python3 train.py
    deps:
    - metrics_and_plots.py
    - model.py
    - processed_dataset/weather.csv
    - train.py
    - utils_and_constants.py
    metrics:
      # Specify the metrics file as target
      - metrics.json:
          cache: false
    plots:
      # Set the target to the file containing predictions data
      - predictions.csv:
          # Write the plot template
          template: confusion_normalized
          x: predicted_label
          y: true_label
          x_label: 'Predicted label'
          y_label: 'True label'
          title: Confusion matrix
          # Set the cache parameter to store
          # plot data in git repository
          cache: false



## Comparing metrics across Git branches

In this exercise, you will use DVC for querying and comparing metrics across different branches. This functionality of DVC is helpful in making decisions about the quality of a machine learning model.

You will start in the main branch, where a DVC pipeline has already been executed and results committed in Git. Your task would entail querying metrics in the main branch. Then, you'll switch to a new training branch, change a hyperparameter, and execute the pipeline again, followed by comparing metrics with the main branch.

### Ide Exercise Instruction
    - Query the metrics in the main branch by running dvc metrics show command in the terminal.
    - Checkout a new branch named train.
    - Change RFC_FOREST_DEPTH to 4 in the opened utils_and_constants.py file, and execute the DVC pipeline.
    - Compare the changed metrics with the main branch using dvc metrics diff main --md | tee metrics_diff.md command.

In [None]:
utils_and_constants.yml
import shutil
from pathlib import Path

DATASET_TYPES = ["test", "train"]
DROP_COLNAMES = ["Date"]
TARGET_COLUMN = "RainTomorrow"
RAW_DATASET = "raw_dataset/weather.csv"
PROCESSED_DATASET = "processed_dataset/weather.csv"
RFC_FOREST_DEPTH = 4


def delete_and_recreate_dir(path):
    try:
        shutil.rmtree(path)
    except:
        pass
    finally:
        Path(path).mkdir(parents=True, exist_ok=True)

#$ dvc metrics diff main --md | tee metrics_diff.md
#$ dvc metrics show
#$ git checkout -b train
#$ dvc repro

## Run DVC pipeline in GitHub Actions

In this exercise, you will use CML GitHub Action to run a DVC pipeline and compare metrics between the training branch and main. The pipeline will trigger when you open a PR against the main branch.

The output from running train.py is a metrics.json file containing model metrics that will provide the source data for comparing metrics across branches.

Your task is to finish the scaffolded .github/workflows/dvc_cml.yaml to formulate a high-level model training flow. Scroll down to Line 24 to make changes.

### Ide Exercise Instruction
    - Setup DVC GitHub Action iterative/setup-dvc@v1.
    - Run DVC pipeline in Run DVC pipeline step.
    - Compare metrics with main branch and write the markdown report in the Write CML report step.
    - Write the correct file in cml comment create command to create a comment in the PR.

In [None]:
dvc_cml.yml
name: dvc-pipeline

on:
  pull_request:
    branches: main

permissions: write-all

jobs:
  train_and_report_eval_performance:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout 
        uses: actions/checkout@v3
      
      - name: Setup Python
        uses: actions/setup-python@v4
        with:
          python-version: 3.9

      - name: Setup CML
        uses: iterative/setup-cml@v1
          
      # Setup DVC GitHub Action
      - name: Setup DVC
        uses: iterative/setup-dvc@v1
          
      # Run DVC pipeline
      - name: Run DVC pipeline
        run: dvc repro

      - name: Write CML report
        env:
          REPO_TOKEN: ${{ secrets.GITHUB_TOKEN }}
        run: |
          # Compare metrics with main branch
          git fetch --prune
          dvc metrics diff --md main >> metrics_compare.md
          
          # Create comment from markdown report
          cml comment create metrics_compare.md

## Adding Hyperparameter tuning to dvc.yaml

In this exercise, your task is to define a hyperparameter tuning workflow. The python file hp_tuning.py is the script for hyperparameter training and takes the hyperparameter configuration file hp_config.json as an input to produce rfc_best_params.json as an output.

The dvc.yaml file outlines the DVC workflow orchestrating the hyperparameter tuning and lists commands, dependencies, and outputs.

NOTE: This exercise involves changing both hp_tuning.py and dvc.yaml. Both files have been opened for you in the editor.

### Ide Exercise Instruction
    - Set the hyperparameter tuning command running the python script hp_tuning.py in dvc.yaml.
    - Specify the hyperparameter configuration hp_config.json in dvc.yaml, and in hp_tuning.py.
    - Specify the hyperparameter training file hp_tuning.py in dvc.yaml as a dependency.
    - Perform Grid Search Cross Validation on training data in hp_tuning.py.

In [None]:
dvc.yml
stages:
  preprocess:
    cmd: python3 preprocess_dataset.py
    deps:
    - preprocess_dataset.py
    - raw_dataset/weather.csv
    - utils_and_constants.py
    outs:
    - processed_dataset/weather.csv
  hp_tune:
    # Set the hyperparameter tuning command
    cmd: python3 hp_tuning.py
    deps:
    - processed_dataset/weather.csv
    # Specify the hyperparameter configuration as dependency
    - hp_config.json
    # Specify the hyperparameter script as dependency
    - hp_tuning.py
    - utils_and_constants.py
    outs:
      - hp_tuning_results.md:
          cache: false

hp_tuning.py
import json

import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, train_test_split
from utils_and_constants import PROCESSED_DATASET, get_hp_tuning_results, load_data


def main():
    X, y = load_data(PROCESSED_DATASET)
    X_train, _, y_train, _ = train_test_split(X, y, random_state=1993)

    model = RandomForestClassifier()
    # Read the config file to define the hyperparameter search space
    param_grid = json.load(open("hp_config.json", "r"))

    # Perform Grid Search Cross Validation on training data
    grid_search = GridSearchCV(model, param_grid, cv=5, n_jobs=1, verbose=2)
    grid_search.fit(X_train, y_train)

    best_params = grid_search.best_params_

    print("====================Best Hyperparameters==================")
    print(json.dumps(best_params, indent=2))
    print("==========================================================")

    with open("rfc_best_params.json", "w") as outfile:
        json.dump(best_params, outfile)

    markdown_table = get_hp_tuning_results(grid_search)
    with open("hp_tuning_results.md", "w") as markdown_file:
        markdown_file.write(markdown_table)


if __name__ == "__main__":
    main()


## Running Hyperparameter tuning DVC pipelines

In this exercise, you will run the hyperparameter training and model training targets outlined in the dvc.yaml pipeline. The dvc.yaml file outlines the DVC workflow orchestrating the jobs and lists commands, dependencies, and outputs.

You will experiment with running these pipelines independently and observe the interaction between the two via the best parameter configuration file rfc_best_params.json.

In your design, this file is meant to be edited manually for training jobs or, alternatively, as an output of a hyperparameter tuning job.

The dvc.yaml file outlines the DVC workflow orchestrating the hyperparameter tuning and listings commands, dependencies, and outputs.

NOTE: You will start working on the main branch. Git and DVC are already initialized for you.

### Ide Exercise Instruction
    - Run DVC training pipeline (target name train) and observe changes in metrics.json (values should be populated).
    - Commit changes with git add . && git commit -m "train on main".
    - Checkout new branch git checkout -b hp_tune_and_train and force run DVC hyperparameter tuning pipeline (target name hp_tune); results file hp_tuning_results.md should now appear in file browser.
    - Run DVC training pipeline again, and compare metrics with main branch using DVC using dvc metrics diff command

In [None]:
metrics.json
{"accuracy": 0.9998, "precision": 1.0, "recall": 0.9993, "f1_score": 0.9996}

#$ git add . && git commit -m "train on main"
#$ git checkout -b hp_tune_and_train
#$ dvc metrics diff
#$ dvc repro train
#$ dvc repro -f hp_tune
#$ dvc metrics diff main

## Setup Hyperparameter Tuning in GitHub Actions

Imagine a repository with the structure shown in the editor. Your task is to finish the scaffolded .github/workflows/hp_cml.yaml to accomplish the hyperparameter tuning and open a new pull request from a new training branch to main that will run the training pipeline by reading best hyperparameters from rfc_best_params.json.

By convention, hyperparameter tuning branches start with hp_tune/ and training branches start with train/.

### Ide Exercise Instruction
    - Guard the hyperparameter tuning job so that it only gets triggered from the correct head branch prefix hp_tune/.
    - Write the correct cml subcommand to create a pull request.
    - Write the correct prefix for the new head (model training) branch and target branch (where code is merged) in the cml subcommand.
    - Write the JSON output file from the hyperparameter tuning job in the cml subcommand.

In [None]:
hp_cml.yml
name: hp-tuning

on:
  pull_request:
    branches: main

permissions: write-all

jobs:
  hp_tune:
    # Only run job if the current repository
    # starts with the right prefix
    if: startsWith(github.head_ref, 'hp_tune/')
    runs-on: ubuntu-latest
    steps:
      - name: Checkout 
        uses: actions/checkout@v3
      
      - name: Setup Python
        uses: actions/setup-python@v4
        with:
          python-version: 3.9

      - name: Setup DVC
        uses: iterative/setup-dvc@v1

      - name: Setup CML
        uses: iterative/setup-cml@v1

      - name: Install dependencies
        run: pip install -r requirements.txt

      - name: Run DVC pipeline
        run: |
          dvc repro -f hp_tune
      
      - name: Create training branch
        env:
          REPO_TOKEN: ${{ secrets.GITHUB_TOKEN }}
        run: |
          # Finish the create pull request command
          cml pr create \
          --user-email hp-bot@cicd.ai \
          --user-name HPBot \
          --message "HP tuning" \
          # Write the new head branch name
          --branch train/${{ github.sha }} \
          # Write the target branch name
          --target-branch main \
          # Commit the hyperparameter job output file
          rfc_best_params.json
