## Develop a classification model

In this exercise, you'll work with the weather dataset and develop a training code to predict rainfall for the next day. The preprocess_dataset.py contains helper functions to pre-process the dataset. Your task is to finish the scaffolded train.py to formulate a high-level model training flow.

Feel free to explore the Python files to see the complete implementation of the workflow.

NOTE: Use python3 instead of python to run Python scripts.

### Ide Exercise Instruction
    - Split the pre-processed dataset in train.py file using the train_test_split method from scikit-learn.
    - Train the model using the training set, by specifying the correct arguments to the train_model method.
    - Calculate test set metrics using the test set, by specifying the correct arguments to the evaluate_model method.
    - Save the metrics dictionary into a JSON file using the save_metrics method.

In [None]:
import json

import pandas as pd
from sklearn.model_selection import train_test_split

from metrics_and_plots import plot_confusion_matrix, save_metrics
from model import evaluate_model, train_model
from utils_and_constants import PROCESSED_DATASET, TARGET_COLUMN


def load_data(file_path):
    data = pd.read_csv(file_path)
    X = data.drop(TARGET_COLUMN, axis=1)
    y = data[TARGET_COLUMN]
    return X, y


def main():
    X, y = load_data(PROCESSED_DATASET)
    
    # Split data into training and test sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1993)

    # Train the model using the training set
    model = train_model(X_train, y_train)
    
    # Calculate test set metrics
    metrics = evaluate_model(model, X_test, y_test)

    print("====================Test Set Metrics==================")
    print(json.dumps(metrics, indent=2))
    print("======================================================")

    # Save metrics into json file
    save_metrics(metrics)
    plot_confusion_matrix(model, X_test, y_test)


if __name__ == "__main__":
    main()

Setup model training using CML

In this exercise, you will use CML GitHub Action to train a Random Forest Classifier to predict rainfall. CML is a GitHub Action that abstracts generating reports for ML experiments.

The training will trigger when you open a PR against the main branch. You'll continue working with the weather dataset; the preprocess_dataset.py file contains helper functions to pre-process the dataset as before.

The output from running train.py is a metrics.json file containing model metrics, and confusion_matrix.png file containing a plot of the confusion matrix.

Your task is to finish the scaffolded .github/workflows/train_cml.yaml to formulate a high-level model training flow.

NOTE: Use python3 instead of python to run Python scripts.
Ide Exercise Instruction
100XP

    Setup CML GitHub Action iterative/setup-cml@v1.
    Add evaluation metrics data, metrics.json, to the markdown report in the Write CML report step.
    Add confusion matrix plot, confusion_matrix.png , to the markdown report in the Write CML report step.
    Write the correct cml comment subcommand to create a comment in the PR.