## Model Training

Let's develop the training code

In [1]:
import os
os.chdir("../")

Let's first define our standard entity configuration class. Notice this class accepts available_models, which will be a list of models we want. Currently I have only allowed logistic regression, svm, and decision trees. In order to add more we would need to change the code below.

In [2]:
from dataclasses import dataclass
from pathlib import Path
from typing import List

@dataclass
class ModelTrainerConfig:
    """
    
    Configuration class for model training operations.

    This class contains all the parameters and paths needed for training
    machine learning models.

    """
    root_dir: Path
    train_data_path: Path
    test_data_path: Path
    model_name: str
    target_column: str
    cross_validation: int
    scoring: str
    available_models: List[str]
    target_column: str
    params: dict


In [3]:
from src.datascience.constants import * 
from src.datascience.utils.common import read_yaml, create_directories
from src.datascience import logger
from src.datascience.config.configuration import DataIngestionConfig

class ConfigurationManager:
    """
    Configuration manager for handling YAML configuration files.
    
    This class loads configuration, parameters, and schema files and provides
    methods to retrieve specific configuration objects.
    """
    def __init__(self, config_filepath=CONFIG_FILE_PATH, params_filepath= PARAMS_FILE_PATH, schema_filepath = SCHEMA_FILE_PATH ):

        """
        Initialize the ConfigurationManager.
        
        Args:
            config_filepath (Path): Path to the main configuration file
            params_filepath (Path): Path to the parameters file
            schema_filepath (Path): Path to the schema file
        """
          
        try:
            self.config = read_yaml(config_filepath)
            self.params = read_yaml(params_filepath)
            self.schema = read_yaml(schema_filepath)
            
            # Create artifacts root directory
            create_directories([self.config.artifacts_root])
            logger.info("ConfigurationManager initialized successfully")
            
        except Exception as e:
            logger.error(f"Error initializing ConfigurationManager: {e}")
            raise
    
    def get_model_trainer_config(self) -> ModelTrainerConfig:
        config = self.config.model_trainer
        params = self.params

        create_directories([config.root_dir])

        model_trainer_config = ModelTrainerConfig(
            root_dir = config.root_dir,
            train_data_path = config.train_data_path,
            test_data_path = config.test_data_path,
            model_name = config.model_name,
            cross_validation = int(config.cross_validation),
            scoring= config.scoring,
            available_models = config.available_models,
            target_column = config.target_column,   

            params = params
        )
        return model_trainer_config

[2025-08-09 18:28:00,094: INFO: __init__: Logger initialized for the datascience package.]


The class below will contain the meat of our logic. The method `__init_mlflow`, initializes and logs us to our mlflow server, currently I am using dagshub that is why I also set username and password. 

the c method will receive a list of names of models we want to train, and it will provide us with the actual objects that represent those models.

The `train` method will train the estimators returned by _make_estimators using crossvalidation and gridsearch. For each of these models we will log parameters and save artifacts related to it.

In [None]:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
from src.datascience import logger
import joblib
from typing import List
from pathlib import Path
import mlflow
import mlflow.sklearn
import os
from dotenv import load_dotenv
import json
load_dotenv()

EXPERIMENT_NAME = "rain-prediction"


class ModelTrainer:
    """
    Component for training machine learning models.
    """
    def __init__(self, config: ModelTrainerConfig):
        self.config = config
        self.estimators = {}

    def __init_mlflow(self):
        # authentication
        os.environ["MLFLOW_TRACKING_USERNAME"] = os.getenv("MLFLOW_TRACKING_USERNAME")
        os.environ["MLFLOW_TRACKING_PASSWORD"] = os.getenv("MLFLOW_TRACKING_PASSWORD")
        mlflow.set_tracking_uri(os.getenv("MLFLOW_TRACKING_URI", "http://localhost:5000"))
        mlflow.set_experiment(EXPERIMENT_NAME)

    def _make_estimators(self):
        """
        This method populates self.estimators with object estimators.
        """

        for name in self.config.available_models:
            if name == "logistic_regression":
                self.estimators[name] = LogisticRegression(max_iter=2000)
            elif name == "random_forest":
                self.estimators[name] = RandomForestClassifier(random_state=42, n_jobs=-1)
            elif name == "svm":
                # probability=True so we can use predict_proba for ROC AUC
                self.estimators[name] = SVC(probability=True)
        

    def train(self):
        """
        Trains several models and saves them.
        
        Raises:
            FileNotFoundError: If training or testing data files don't exist
            Exception: If there's an error during training
        """
        try:

            # Set mlflow
            self.__init_mlflow()

            # load data
            train_data = pd.read_csv(self.config.train_data_path)

            train_x = train_data.drop([self.config.target_column], axis=1)
            train_y = train_data[self.config.target_column].astype(int)

            # Obtain estimators
            self._make_estimators()

            # Define Constants
            leaderboard = [] # This will hold a leaderboard of the models we will train
            best_name, best_est, best_score  = None, None, float("-inf")
            best_run_id = ""

             # Use gridSearch with cross validation to train the models and log to MLflow
            for name, estimator in self.estimators.items():
                logger.info(f"Training {name} with GridSearchCV")

                param_grid = self.config.params.model_params.get(name, {})
                grid_search = GridSearchCV(
                    estimator=estimator,
                    param_grid=param_grid,
                    cv=self.config.cross_validation,
                    scoring=self.config.scoring,
                    n_jobs=-1
                )

                with mlflow.start_run(run_name=f"train:{name}") as run:
                    mlflow.log_param("model_name", name)
                    mlflow.log_param("cv_folds", int(self.config.cross_validation))
                    mlflow.log_param("scoring", self.config.scoring)
                    mlflow.log_param("grid_size", len(param_grid) if isinstance(param_grid, dict) else 0)

                    grid_search.fit(train_x, train_y)

                    # Log the best params and CV score
                    mlflow.log_metric("cv_best_score", float(grid_search.best_score_))
                    for param,value in grid_search.best_params_.items():
                        mlflow.log_param(f"best_{param}", value)

                    # Let's save the full cv_results as an artifact
                    cv_results_path = os.path.join(self.config.root_dir, f"{name}_cv_results.json")
                    with open(cv_results_path, "w") as f:
                        json.dump(grid_search.cv_results_, f, default=str, indent=2)
                    mlflow.log_artifact(str(cv_results_path))

                    # Log the best estimator for this model 
                    mlflow.sklearn.log_model(grid_search.best_estimator_, artifact_path=name)

                    leaderboard.append({
                        "model": name,
                        "cv_best_score": float(grid_search.best_score_),
                        "best_params": grid_search.best_params_
                    })

                    # global best tracking 
                    if grid_search.best_score_ > best_score:
                        best_score = grid_search.best_score_
                        best_name, best_est = name, grid_search.best_estimator_
                        best_run_id = run.info.run_id



                    logger.info(f"{name} best params: {grid_search.best_params_}")
                    logger.info(f"{name} best CV score: {grid_search.best_score_:.4f}")

         
            # Save best model locally
            joblib.dump(best_est, os.path.join(self.config.root_dir, self.config.model_name))

            # Save leaderboard
            Path(os.path.join(self.config.root_dir, "cv_leaderboard.json")).write_text(json.dumps(leaderboard, indent=2))

            # Record the training run_id of this MLflow run
            Path(os.path.join(self.config.root_dir, "train_run_id.txt")).write_text(best_run_id)

            logger.info(f"Training complete. Best model: '{best_name}' (CV={best_score:.4f})")
            logger.info(f"Saved best model")
            logger.info(f"Saved leaderboard")
            logger.info(f"Saved train run id to")


        except FileNotFoundError as e:
            logger.error(f"Data file not found: {e}")
            raise
        except Exception as e:
            logger.error(f"Error during model training: {e}")
            raise
        

In [5]:
try:
    config = ConfigurationManager()
    model_trainer_config = config.get_model_trainer_config()
    model_trainer = ModelTrainer(config=model_trainer_config)
    model_trainer.train()
except Exception as e:
    raise e 

[2025-08-09 18:28:01,047: INFO: common: YAML file: config/config.yaml loaded successfully]
[2025-08-09 18:28:01,050: INFO: common: YAML file: params.yaml loaded successfully]
[2025-08-09 18:28:01,051: INFO: common: YAML file: schema.yaml loaded successfully]
[2025-08-09 18:28:01,052: INFO: common: Created directory at artifacts]
[2025-08-09 18:28:01,052: INFO: 1379764818: ConfigurationManager initialized successfully]
[2025-08-09 18:28:01,117: INFO: common: Created directory at artifacts/model_trainer]


2025/08/09 18:28:01 INFO mlflow.tracking.fluent: Experiment with name 'rain-prediction' does not exist. Creating a new experiment.


[2025-08-09 18:28:01,899: INFO: 950285947: Training logistic_regression with GridSearchCV]


90 fits failed out of a total of 540.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
90 fits failed with the following error:
Traceback (most recent call last):
  File "/home/armando-albornoz/Desktop/ml/MLOPS_course/project1/datascienceendtoend1/env/lib/python3.11/site-packages/sklearn/model_selection/_validation.py", line 859, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/home/armando-albornoz/Desktop/ml/MLOPS_course/project1/datascienceendtoend1/env/lib/python3.11/site-packages/sklearn/base.py", line 1365, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/armando-albornoz/Desktop/ml/MLOPS_course/project1/datascienceendtoend1/env/l

[2025-08-09 18:28:15,677: INFO: 950285947: logistic_regression best params: {'C': 0.1, 'l1_ratio': 0.1, 'max_iter': 2000, 'penalty': 'l2', 'solver': 'saga'}]
[2025-08-09 18:28:15,678: INFO: 950285947: logistic_regression best CV score: 0.9064]
[2025-08-09 18:28:15,799: INFO: 950285947: Training random_forest with GridSearchCV]
[2025-08-09 18:29:47,203: INFO: 950285947: random_forest best params: {'class_weight': 'balanced', 'max_depth': 10, 'min_samples_leaf': 2, 'min_samples_split': 5, 'n_estimators': 200, 'n_jobs': -1}]
[2025-08-09 18:29:47,206: INFO: 950285947: random_forest best CV score: 0.9201]
[2025-08-09 18:29:47,327: INFO: 950285947: Training svm with GridSearchCV]
[2025-08-09 18:30:06,369: INFO: 950285947: svm best params: {'C': 100, 'class_weight': 'balanced', 'gamma': 0.01, 'kernel': 'rbf', 'probability': True}]
[2025-08-09 18:30:06,370: INFO: 950285947: svm best CV score: 0.9172]
[2025-08-09 18:30:07,425: INFO: 950285947: Training complete. Best model: 'random_forest' (CV=