## **Problem Statement**

### **Business Context**

An automobile dealership in Los Vegas specializes in selling luxury and non-luxury vehicles. They cater to diverse customer preferences with varying vehicle specifications, such as mileage, engine capacity, and seating capacity. However, the dealership faces significant challenges in maintaining consistency and efficiency across its pricing strategy due to reliance on manual processes and disconnected systems. Pricing evaluations are prone to errors, updates are delayed, and scaling operations are difficult as demand grows. These inefficiencies impact revenue and customer trust. Recognizing the need for a reliable and scalable solution, the dealership is seeking to implement a unified system that ensures seamless integration of data-driven pricing decisions, adaptability to changing market conditions, and operational efficiency.

### **Objective**

The dealership has hired you as an MLOps Engineer to design and implement an MLOps pipeline that automates the pricing workflow. This pipeline will encompass data cleaning, preprocessing, transformation, model building, training, evaluation, and registration with CI/CD capabilities to ensure continuous integration and delivery. Your role is to overcome challenges such as integrating disparate data sources, maintaining consistent model performance, and enabling scalable, automated updates to meet evolving business needs. The expected outcomes are a robust, automated system that improves pricing accuracy, operational efficiency, and scalability, driving increased profitability and customer satisfaction.

### **Data Description**

The dataset contains attributes of used cars sold in various locations. These attributes serve as key data points for CarOnSell's pricing model. The detailed attributes are:

- **Segment:** Describes the category of the vehicle, indicating whether it is a luxury or non-luxury segment.

- **Kilometers_Driven:** The total number of kilometers the vehicle has been driven.

- **Mileage:** The fuel efficiency of the vehicle, measured in kilometers per liter (km/l).

- **Engine:** The engine capacity of the vehicle, measured in cubic centimeters (cc). 

- **Power:** The power of the vehicle's engine, measured in brake horsepower (BHP). 

- **Seats:** The number of seats in the vehicle, can influence the vehicle's classification, usage, and pricing based on customer needs.

- **Price:** The price of the vehicle, listed in lakhs (units of 100,000), represents the cost to the consumer for purchasing the vehicle.

## **Please read the instructions carefully before starting the project.**

This is a commented Python Notebook file in which all the instructions and tasks to be performed are mentioned. 
* Blanks '_______' are provided in the notebook that 
needs to be filled with an appropriate code to get the correct result. With every '_______' blank, there is a comment that briefly describes what needs to be filled in the blank space. 
* Identify the task to be performed correctly, and only then proceed to write the required code.
* Fill the code wherever required. Running incomplete code may throw error.
* Please run the codes in a sequential manner from the beginning to avoid any unnecessary errors.

## **1. AzureML Environment Setup and Data Preparation**

### **1.1 Connect to Azure Machine Learning Workspace**

In [3]:
# Handle to the workspace
from azure.ai.ml import MLClient

# Authentication package
from azure.identity import DefaultAzureCredential
credential = DefaultAzureCredential()

In [1]:
%pip install python-dotenv azure-identity

Note: you may need to restart the kernel to use updated packages.


In [4]:
import os
from dotenv import load_dotenv
from azure.identity import DefaultAzureCredential
from azure.mgmt.resource import ResourceManagementClient

# Load environment variables from .env file
load_dotenv()

# `DefaultAzureCredential` will automatically pick up the AZURE_CLIENT_ID,
# AZURE_TENANT_ID, and AZURE_CLIENT_SECRET environment variables.
credential = DefaultAzureCredential()

# You can now use this credential to authenticate with any Azure service client.
# For example, let's list the resource groups in your subscription.

# Get the subscription ID from the environment variables
subscription_id = os.getenv("AZURE_SUBSCRIPTION_ID")

if not subscription_id:
    raise ValueError("AZURE_SUBSCRIPTION_ID not set in .env file")

# Create a client to interact with Azure resources
resource_client = ResourceManagementClient(credential, subscription_id)

# List resource groups
print(f"Resource groups in subscription {subscription_id}:")
for group in resource_client.resource_groups.list():
    print(f"- {group.name}")

ModuleNotFoundError: No module named 'azure.mgmt.resource'

In [None]:
# Get a handle to the workspace
ml_client = MLClient(
    credential=credential,
    subscription_id=os.getenv("SUBSCRIPTION_ID"),
    resource_group_name=os.getenv("AZURE_RESOURCE_GROUP"),
    workspace_name=os.getenv("AZURE_WORKSPACE"),
)

### **1.2 Set Up Compute Cluster**

In [3]:
from azure.ai.ml.entities import AmlCompute

# Name assigned to the compute cluster
cpu_compute_target = "cpu-cluster"

try:
    # let's see if the compute target already exists
    cpu_cluster = ml_client.compute.get(cpu_compute_target)
    print(
        f"You already have a cluster named {cpu_compute_target}, we'll reuse it as is."
    )

except Exception:
    print("Creating a new cpu compute target...")

    # Let's create the Azure ML compute object with the intended parameters
    cpu_cluster = AmlCompute(
        name=cpu_compute_target,
        # Azure ML Compute is the on-demand VM service
        type="amlcompute",
        # VM Family
        size="Standard_DS11_v2",
        # Minimum running nodes when there is no job running
        min_instances=0,
        # Nodes in cluster
        max_instances=1,
        # How many seconds will the node running after the job termination
        idle_time_before_scale_down=180,
        # Dedicated or LowPriority. The latter is cheaper but there is a chance of job termination
        tier="Dedicated",
    )

    # Now, we pass the object to MLClient's create_or_update method
    cpu_cluster = ml_client.compute.begin_create_or_update(cpu_cluster).result()

print(
    f"AMLCompute with name {cpu_cluster.name} is created, the compute size is {cpu_cluster.size}"
)

You already have a cluster named cpu-cluster, we'll reuse it as is.
AMLCompute with name cpu-cluster is created, the compute size is Standard_DS11_v2


### **1.3 Register Dataset as Data Asset**

In [4]:
from azure.ai.ml.entities import Data
from azure.ai.ml.constants import AssetTypes

# Path to the local dataset
local_data_path = 'used_cars.csv'

# Create and register the dataset as an AzureML data asset
data_asset = Data(
    path=local_data_path,
    type=AssetTypes.URI_FILE, 
    description="A dataset of used cars for price prediction",
    name="used-cars-data"
)

In [5]:
ml_client.data.create_or_update(data_asset)

Data({'path': 'azureml://subscriptions/6490c64b-602a-4887-b258-36064f4cb8d4/resourcegroups/default_resourse_group/workspaces/demo_workspace/datastores/workspaceblobstore/paths/LocalUpload/2be82e6311791c3eb0847ecab5279e37/used_cars.csv', 'skip_validation': False, 'mltable_schema_url': None, 'referenced_uris': None, 'type': 'uri_file', 'is_anonymous': False, 'auto_increment_version': False, 'auto_delete_setting': None, 'name': 'used-cars-data', 'description': 'A dataset of used cars for price prediction', 'tags': {}, 'properties': {}, 'print_as_yaml': False, 'id': '/subscriptions/6490c64b-602a-4887-b258-36064f4cb8d4/resourceGroups/default_resourse_group/providers/Microsoft.MachineLearningServices/workspaces/demo_workspace/data/used-cars-data/versions/8', 'Resource__source_path': '', 'base_path': '/mnt/batch/tasks/shared/LS_root/mounts/clusters/c002/code/Users/TESTP3XHV8C5OT_1734342789061', 'creation_context': <azure.ai.ml.entities._system_data.SystemData object at 0x7fc05c4dd840>, 'seria

### **1.4 Create and Configure Job Environment**

In [6]:
# Create a directory for the preprocessing script
import os

src_dir_env = "./env"
os.makedirs(src_dir_env, exist_ok=True)

In [7]:
%%writefile {src_dir_env}/conda.yml
name: sklearn-env
channels:
  - conda-forge
dependencies:
  - python=3.8
  - pip=21.2.4
  - scikit-learn=0.23.2
  - scipy=1.7.1
  - pip:  
    - mlflow==2.8.1
    - azureml-mlflow==1.51.0
    - azureml-inference-server-http
    - azureml-core==1.49.0
    - cloudpickle==1.6.0

Overwriting ./env/conda.yml


In [8]:
from azure.ai.ml.entities import Environment, BuildContext

env_docker_conda = Environment(
    image="mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04",
    conda_file="env/conda.yml",
    name="machine_learning_E2E",
    description="Environment created from a Docker image plus Conda environment.",
)
ml_client.environments.create_or_update(env_docker_conda)

Environment({'arm_type': 'environment_version', 'latest_version': None, 'image': 'mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04', 'intellectual_property': None, 'is_anonymous': False, 'auto_increment_version': False, 'auto_delete_setting': None, 'name': 'machine_learning_E2E', 'description': 'Environment created from a Docker image plus Conda environment.', 'tags': {}, 'properties': {'azureml.labels': 'latest'}, 'print_as_yaml': False, 'id': '/subscriptions/6490c64b-602a-4887-b258-36064f4cb8d4/resourceGroups/default_resourse_group/providers/Microsoft.MachineLearningServices/workspaces/demo_workspace/environments/machine_learning_E2E/versions/6', 'Resource__source_path': '', 'base_path': '/mnt/batch/tasks/shared/LS_root/mounts/clusters/c002/code/Users/TESTP3XHV8C5OT_1734342789061', 'creation_context': <azure.ai.ml.entities._system_data.SystemData object at 0x7fc05c4c57e0>, 'serialize': <msrest.serialization.Serializer object at 0x7fc05c4c4e80>, 'version': '6', 'conda_file': {'chann

## **2. Model Development Workflow**

### **2.1 Data Preparation**

This **Data Preparation job** is designed to process an input dataset by splitting it into two parts: one for training the model and the other for testing it. The script accepts three inputs: the location of the input data (`used_cars.csv`), the ratio for splitting the data into training and testing sets (`test_train_ratio`), and the paths to save the resulting training (`train_data`) and testing (`test_data`) data. The script first reads the input CSV data from a data asset URI, then splits it using Scikit-learn's train_test_split function, and saves the two parts to the specified directories. It also logs the number of records in both the training and testing datasets using MLflow.

In [9]:
# Create a directory for the preprocessing script
import os

src_dir_job_scripts = "./data_prep"
os.makedirs(src_dir_job_scripts, exist_ok=True)

In [10]:
%%writefile {src_dir_job_scripts}/data_prep.py

import os
import argparse
import logging
import mlflow
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("--data", type=str, help="Path to input data")
    parser.add_argument("--test_train_ratio", type=float, default=0.2)
    parser.add_argument("--train_data", type=str, help="Path to save train data")
    parser.add_argument("--test_data", type=str, help="Path to save test data")
    args = parser.parse_args()

    # Start MLflow Run
    mlflow.start_run()

    # Log arguments
    logging.info(f"Input data path: {args.data}")
    logging.info(f"Test-train ratio: {args.test_train_ratio}")

    # Reading Data
    df = pd.read_csv(args.raw_data)

    # Encode categorical feature
    le = LabelEncoder()
    df['_______'] = le.fit_transform(df['_______'])  # Write code to encode the categorical feature

    # Split Data into train and test datasets
    train_df, test_df = train_test_split(df, test_size=args.________, random_state=42)  #  Write code to split the data into train and test datasets

    # Save train and test data
    os.makedirs(args.________, exist_ok=True)  # Create directories for train_data and test_data
    os.makedirs(args.________, exist_ok=True)  # Create directories for train_data and test_data
    train_df.to_csv(os.path.join(args.train_data, "________.csv"), index=False)  # Specify the name of the train data file
    test_df.to_csv(os.path.join(args.test_data, "________.csv"), index=False)  # Specify the name of the test data file

    # log the metrics
    mlflow.log_metric('train size', train_df.shape[__])  # Log the train dataset size
    mlflow.log_metric('test size', test_df.shape[__])  # Log the test dataset size
    
    mlflow.end_run()

if __name__ == "__main__":
    main()

Overwriting ./data_prep/data_prep.py


#### **Define Data Preparation job**

For this AzureML job, we define the `command` object that takes input files and output directories, then executes the script with the provided inputs and outputs. The job runs in a pre-configured AzureML environment with the necessary libraries. The result will be two separate datasets for training and testing, ready for use in subsequent steps of the machine learning pipeline.

In [None]:
from azure.ai.ml import command, Input, Output

step_process = command(
    name="data_preparation",  # Specify the name of the job
    display_name="__________________",  # Provide a display name for the job
    description="_______________________",  # Provide a description of the job
    inputs={ 
        "data": Input(type="________"),  # Specify input type for data (e.g., file URI)
        "test_train_ratio": Input(type="__________"),  # Specify input type for the test/train ratio (float)
    },
    outputs={  
        "train_data": Output(type="________", mode="rw_mount"),  # Specify output type for train data (e.g., folder URI)
        "test_data": Output(type="_______", mode="rw_mount"),  # Specify output type for test data (e.g., folder URI)
    },
    code="./data_prep",  # Path to the data preparation script
    command="""python ________ 
            --data ${{inputs.data}} \  
            --test_train_ratio ${{inputs.test_train_ratio}} \  
            --train_data ${{outputs.train_data}} \  
            --test_data ${{outputs.________}}""",  # Specify the output for test data
    environment="AzureML-sklearn-1.0-ubuntu20.04-py38-cpu@latest",  # Provide the environment name for execution
)


### **2.2 Training the Model**

This Model Training job is designed to train a **Random Forest Regressor** on the dataset that was split into training and testing sets in the previous data preparation job. This job script accepts five inputs: the path to the training data (`train_data`), the path to the testing data (`test_data`), the number of trees in the forest (`n_estimators`, with a default value of 100), the maximum depth of the trees (`max_depth`, which is set to None by default), and the path to save the trained model (`model_output`).

The script begins by reading the training and testing data files, then processes the data to separate features (X) and target labels (y). A Random Forest Regressor model is initialized using the given n_estimators and max_depth, and it is trained using the training data. The model's performance is evaluated using the `Mean Squared Error (MSE)`. The MSE score is logged in MLflow. Finally, the trained model is saved and stored in the specified output location as an MLflow model. The job completes by logging the final MSE score and ending the MLflow run.


In [12]:
# Create a directory for the preprocessing script
import os

src_dir_job_scripts = "./model_train"
os.makedirs(src_dir_job_scripts, exist_ok=True)

In [13]:
%%writefile {src_dir_job_scripts}/model_train.py

# Required imports for training
import mlflow
import argparse
import os
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

mlflow.start_run()  # Start the MLflow experiment run

os.makedirs("./outputs", exist_ok=True)  # Create the "outputs" directory if it doesn't exist

def select_first_file(path):
    """Selects the first file in a folder, assuming there's only one file.
    Args:
        path (str): Path to the directory or file to choose.
    Returns:
        str: Full path of the selected file.
    """
    files = os.listdir(path)
    return os.path.join(path, files[0])

def main():
    parser = argparse.ArgumentParser("train")
    parser.add_argument("--train_data", type=str, help="Path to train dataset")
    parser.add_argument("--test_data", type=str, help="Path to test dataset")
    parser.add_argument("--model_output", type=str, help="Path of output model")
    parser.add_argument('--n_estimators', type=int, default=100,
                        help='The number of trees in the forest')
    parser.add_argument('--max_depth', type=int, default=None,
                        help='The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.')

    args = parser.parse_args()

    # Load datasets
    train_df = pd.read_csv(select_first_file(args.train_data))
    test_df = pd.read_csv(select_first_file(args.test_data))

    # Split the data into ______(X) and ______(y) 
    y_train = train_df['______']  # Specify the target column
    X_train = train_df.drop(columns=['______'])
    y_test = test_df['______']
    X_test = test_df.drop(columns=['______'])

    # Initialize and train a RandomForest Regressor
    model = RandomForestRegressor(n_estimators=args.n_estimators, max_depth=args.______, random_state=42)  # Provide the arguments for RandomForestRegressor
    model.________(X_train, y_train)  # Train the model

    # Log model hyperparameters
    mlflow.log_param("model", "_________")  # Provide the model name
    mlflow.log_param("n_estimators", args.n_estimators)
    mlflow.log_param("max_depth", args.______)

    # Predict using the RandomForest Regressor on test data
    yhat_test = model._______(X_test)  # Predict the test data

    # Compute and log mean squared error for test data
    mse = mean_squared_error(y_test, yhat_test)
    print('Mean Squared Error of RandomForest Regressor on test set: {:.2f}'.format(mse))
    mlflow.log_metric("MSE", float(mse))  # Log the MSE

    # Save the model
    mlflow.sklearn.________(sk_model=model, path=args.model_output)  # Save the model

    mlflow.end_run()  # Ending the MLflow experiment run

if __name__ == "__main__":
    main()

Overwriting ./model_train/model_train.py


#### **Define Model Training Job**

For this AzureML job, we define the `command` object that takes the paths to the training and testing data, the number of trees in the forest (`n_estimators`), and the maximum depth of the trees (`max_depth`) as inputs, and outputs the trained model. The command runs in a pre-configured AzureML environment with all the necessary libraries. The job produces a trained **Random Forest Regressor model**, which can be used for predicting the price of used cars based on the given attributes.

In [14]:
from azure.ai.ml import command, Input, Output

train_step = command(
    name="train_price_prediction_model",  # Specify the name of the command step for model training
    display_name="________________________",  # Provide a descriptive display name
    description="_________________________",  # Description of the task to be performed
    inputs={  # Define inputs required for the training command
        "train_data": Input(type="__________"),  # Specify input type for train data (e.g., file URI)
        "test_data": Input(type="__________"),  # Specify input type for test data (e.g., file URI)
        "n_estimators": Input(type="number", default=____),  # Specify default value for number of estimators (trees in Random Forest)
        "max_depth": Input(type="number", default=___),  # Set default value for the maximum depth of the trees
    },
    outputs={  # Define the output of the training job
        "model_output": Output(type="mlflow_model"),  # Path to save the trained model
    },
    code="________",  # Fill in the directory where the training script (model_train.py) is located
    command="""python ________ 
            --train_data ${{inputs.train_data}} \ 
            --test_data ${{inputs.test_data}} \ 
            --n_estimators ${{inputs.n_estimators}} \ 
            --max_depth ${{inputs.max_depth}} \ 
            --model_output ${{outputs.model_output}}""",
    environment="AzureML-sklearn-1.0-ubuntu20.04-py38-cpu@latest",
    compute="________",  # Specify the compute target (e.g., "cpu-cluster")
)


### **2.3 Registering the Best Trained Model**

The **Model Registration job** is designed to take the best-trained model from the hyperparameter tuning sweep job and register it in MLflow as a versioned artifact for future use in the used car price prediction pipeline. This job script accepts one input: the path to the trained model (model). The script begins by loading the model using the `mlflow.sklearn.load_model()` function. Afterward, it registers the model in the MLflow model registry, assigning it a descriptive name (`used_cars_price_prediction_model`) and specifying an artifact path (`random_forest_price_regressor`) where the model artifacts will be stored. Using MLflow's `log_model()` function, the model is logged along with its metadata, ensuring that the model is easily trackable and retrievable for future evaluation, deployment, or retraining.

In [23]:
# Create a directory for the preprocessing script
import os

src_dir_job_scripts = "./model_register"
os.makedirs(src_dir_job_scripts, exist_ok=True)

In [24]:
%%writefile {src_dir_job_scripts}/model_register.py

import os
import argparse
import logging
import mlflow
import pandas as pd
from pathlib import Path

mlflow.start_run()  # Starting the MLflow experiment run

def main():
    # Argument parser setup for command line arguments
    parser = argparse.ArgumentParser()
    parser.add_argument("--model", type=str, help="Path to the trained model")  # Path to the trained model artifact
    args = parser.parse_args()

    # Load the trained model from the provided path
    model = mlflow.sklearn.load_model(args.model)  # _______ (Fill the code to load model from args.model)

    print("Registering the best trained used cars price prediction model")
    
    # Register the model in the MLflow Model Registry under the name "price_prediction_model"
    mlflow.sklearn.log_model(
        sk_model=model,
        registered_model_name="__________",  # Specify the name under which the model will be registered
        artifact_path="__________"  # Specify the path where the model artifacts will be stored
    )

    # End the MLflow run
    mlflow.end_run()  # ________ (Fill in the code to end the MLflow run)

if __name__ == "__main__":
    main()

Overwriting ./model_register/model_register.py


#### **Define Model Register Job**

For this AzureML job, a `command` object is defined to execute the `model_register.py` script. It accepts the best-trained model as input, runs the script in the `AzureML-sklearn-1.0-ubuntu20.04-py38-cpu` environment, and uses the same compute cluster as the previous jobs (`cpu-cluster`). This job plays a crucial role in the pipeline by ensuring that the best-performing model identified during hyperparameter tuning is systematically stored and made available in the MLflow registry for further evaluation, deployment, or retraining. Integrating this job into the end-to-end pipeline automates the process of registering high-quality models, completing the model development lifecycle and enabling the prediction of used car prices.

In [None]:
from azure.ai.ml import command, Input

model_register_component = command(
    name="register_model", 
    display_name="_____________",  # Provide a descriptive display name for the step
    description="_______________________",  # Describe the task
    inputs={  # Define inputs required for the model registration command
        "model": Input(type="mlflow_model"), 
    },
    code="________",  # Fill in the directory where the model register script (model_register.py) is located
    command="""python  model_register.py \ 
            --model ${{inputs.model}}""",
    environment="AzureML-sklearn-1.0-ubuntu20.04-py38-cpu@latest",  # Environment configuration for the model register job
    compute="cpu-cluster",  # Specify the compute target to be used for the job
)


### **2.4. Assembling the End-to-End Workflow**

The end-to-end pipeline integrates all the previously defined jobs into a seamless workflow, automating the process of data preparation, model training, hyperparameter tuning, and model registration. The pipeline is designed using Azure Machine Learning's `@pipeline` decorator, specifying the compute target and providing a detailed description of the workflow.

In [None]:
from azure.ai.ml.sweep import Choice
from azure.ai.ml.entities import Model
from azure.ai.ml.constants import ModelType
from azure.ai.ml.dsl import pipeline

# Assemble the pipeline by chaining the jobs
@pipeline(
    compute="cpu-cluster",  # Compute target for the pipeline
    description="______________________",  # Provide a description for the pipeline
)
def complete_pipeline(input_data_uri, test_train_ratio, n_estimators, max_depth):
    
    # Step 1: Preprocess the data
    preprocess_step = step_process(
        data=input_data_uri,  # Input URI for data
        test_train_ratio=test_train_ratio,  # Input for the train-test split ratio (e.g., 0.8)
    )
    
    # Step 2: Train the model using preprocessed data
    training_step = train_step(
        train_data=preprocess_step.outputs.train_data,  # Output from data preprocessing (training data)
        test_data=preprocess_step.outputs.test_data,  # Output from data preprocessing (testing data)
        n_estimators=n_estimators,  # Input for the number of estimators (e.g., 50)
        max_depth=max_depth,  # Input for the maximum depth of trees (e.g., 10)
    )
    
    # Define the training step with hyperparameters for tuning
    job_for_sweep = training_step(
        n_estimators=Choice(values=[10, 20, 30, 50]),
        max_depth=Choice(values=___________________),  # List of possible values for max_depth
    )

    # Define the sweep job
    sweep_job = job_for_sweep.sweep(
        compute="cpu-cluster",
        sampling_algorithm="random",
        primary_metric="MSE",
        goal="Minimize",
    )

    # Set the limits for the sweep job:
    # - max_total_trials: The maximum number of hyperparameter combinations to be evaluated (20 in this case).
    # - max_concurrent_trials: The maximum number of trials to run simultaneously (10 in this case) to optimize resource utilization.
    # - timeout: The maximum allowed duration for the sweep job in seconds (7200 seconds, or 2 hours).
    sweep_job.set_limits(max_total_trials=20, max_concurrent_trials=10, timeout=7200)
    
    # Step 3: Register the best model
    # After the sweep job, get the best model
    model_register_step = model_register_component(
        model=job_for_sweep.outputs.model_output,  # Output from sweep job (best model)
    )

    # Returning outputs from all steps in the pipeline
    return {
        "pipeline_job_train_data": preprocess_step.outputs.train_data,  # Output from preprocessing step (train data)
        "pipeline_job_test_data": preprocess_step.outputs.test_data,  # Output from preprocessing step (test data)
        "pipeline_job_best_model": job_for_sweep.outputs.model_output,  # Output from sweep job (best model)
    }


1. **Data Preparation (Preprocessing Step)**:
The pipeline starts by invoking the `step_process` job, which preprocesses the raw input data (`input_data_uri`). This step splits the dataset into training and testing sets based on the provided `test_train_ratio`. The outputs from this step include the processed training and testing datasets (`train_data` and `test_data`), which are passed as inputs to the next step.

2. **Model Training**:
The second step in the pipeline is the `train_step`, which trains a **Random Forest Regressor model** using the preprocessed training data. The job uses `train_data` and `test_data` from the preprocessing step and accepts hyperparameters like `n_estimators` and `max_depth` to configure the model. The training step is designed to work flexibly with the parameters defined in the pipeline, allowing experimentation. This step evaluates the model's performance using the `Mean Squared Error (MSE)` metric and logs the result in MLflow. The trained model is then saved and stored in the specified output location as an MLflow model.

3. **Hyperparameter Tuning**:
To optimize the model's performance, a Sweep Job is defined based on the training step. The sweep job explores multiple combinations of hyperparameters (`n_estimators` and `max_depth`) using a random sampling algorithm. It aims to minimize the model's `Mean Squared Error (MSE)` to ensure accurate price predictions. The job limits are set to allow a maximum of 20 trials, with up to 10 trials running concurrently, and a total timeout of 7200 seconds (2 hours). This step identifies the best combination of hyperparameters for the model.

4. **Model Registration**:
Once the sweep job completes, the best-performing model is passed to the `model_register_component`. This step registers the model in the MLflow model registry, ensuring that it is versioned and available for deployment or future experimentation. The registered model includes its metadata and is stored with a descriptive name (`used_cars_price_prediction_model`).

5. **Pipeline Outputs**:
The pipeline returns key outputs for further analysis, including the locations of the training and testing datasets and the best-trained model from the sweep job. These outputs ensure traceability and provide resources for subsequent tasks like evaluation and deployment.

The pipeline is instantiated by providing the required inputs, such as the data path, test-train ratio, and initial values for hyperparameters (`n_estimators` and `max_depth`). It is then submitted to Azure Machine Learning for execution under the experiment name `price_prediction_pipeline`. Real-time logs can be streamed to monitor the pipeline's progress. Once the pipeline completes, the outputs can be accessed for verification.

In [27]:
# The code retrieves a specific version of a registered data asset using the ml_client object.
data_path = ml_client.data.get("____________", version="1").path # Provide the name of the data asset

In [28]:
# Create pipeline instance
pipeline_instance = complete_pipeline(
    input_data_uri=Input(type="uri_file", path=data_path),  # Dataset path
    test_train_ratio=0.2,  # Test-train ratio
    n_estimators=50,       # Initial value for n_estimators
    max_depth=5             # Initial value for max depth
)

In [None]:
# Submit the pipeline to Azure ML
pipeline_job = ml_client.jobs.create_or_update(
    pipeline_instance, 
    experiment_name="_________________________" # Provide the experiment name
)

# Stream the output of the job for real-time logs
ml_client.jobs.stream(pipeline_job.name)

In [None]:
# Access pipeline outputs (optional, after job completion)
print(f"Train data location: {pipeline_job.outputs['pipeline_job_train_data']}")
print(f"Test data location: {pipeline_job.outputs['pipeline_job_test_data']}")
print(f"Best model location: {pipeline_job.outputs['pipeline_job_best_model']}")