# Used Car Prices CarGurus on Azure: XGBoost

# Set up the workspace
Connect to the Azure Machine Learning workspace with `MLClient`.

In [None]:
from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential

# Authenticate
credential = DefaultAzureCredential()

# Get a handle to the workspace
ml_client = MLClient(
    credential=credential,
    subscription_id='a134465f-eb15-4297-b902-3c97d4c81838',
    resource_group_name='aschultzdata',
    workspace_name='ds-ml-env',
)

## Create a compute cluster to run the job
Examine if the cluster has all ready been created or not.

In [None]:
from azure.ai.ml.entities import AmlCompute

# Define the name of the compute cluster
gpu_compute_target = 'gpu-cluster-NC4as-T4-v3'

# Examine if the cluster already exists
gpu_cluster = ml_client.compute.get(gpu_compute_target)
print(f"You already have a cluster named {gpu_compute_target}, we'll reuse it as is.")

You already have a cluster named gpu-cluster-NC4as-T4-v3, we'll reuse it as is.


## Create the job environment
The environment lists the components of the runtime and the libraries installed on the compute for training the model.

In [None]:
import os

dependencies_dir = ''./dependencies'
os.makedirs(dependencies_dir, exist_ok=True)

Now we can write the `conda` file into the `dependencies` directory.

In [None]:
%%writefile {dependencies_dir}/conda.yaml
name: model-env
channels:
  - conda-forge
dependencies:
  - python=3.8.5
  - pip=23.1.2
  - numpy=1.21.6
  - scikit-learn==1.1.2
  - scipy
  - pandas>=1.1,<1.2
  - pip:
    - inference-schema[numpy-support]==1.3.0
    - mlflow
    - azureml-mlflow
    - psutil==5.9.0
    - tqdm
    - ipykernel
    - matplotlib
    - xgboost==1.5.2
    - optuna==3.0.0
    - eli5==0.13.0
    - shap==0.41.0

Writing ./dependencies/conda.yaml


The created `conda.yaml` file allows for the environment to be created and registered in the workspace.

In [None]:
from azure.ai.ml.entities import Environment

custom_env_name = 'aml-usedcars-gpu-xgboost'

custom_job_env = Environment(
    name=custom_env_name,
    description='Custom environment for Used Cars XGboost job',
    tags={'xgboost': '1.5.2'},
    conda_file=os.path.join(dependencies_dir, 'conda.yaml'),
    image='mcr.microsoft.com/azureml/openmpi3.1.2-ubuntu18.04:latest',
)
custom_job_env = ml_client.environments.create_or_update(custom_job_env)

print(
    f"Environment with name {custom_job_env.name} is registered to workspace, the environment version is {custom_job_env.version}"
)

Environment with name aml-usedcars-gpu-xgboost is registered to workspace, the environment version is 2


## Create training script
First, the source folder where the training script, `main.py`, will be stored needs to be created.

In [None]:
import os

train_src_dir = './src'
os.makedirs(train_src_dir, exist_ok=True)

The training script consists of preparing the environment, reading the data, data preparation, model training, saving the model and evaluating the model.
This includes specifying the dependencies to import and utilize, setting the seed, defining the input/output arguments of `argparse`, reading the train/test sets, defining the features/target and preprocessing the data for dummy variables. Then the number of samples and features are logged with `MLFlow`. It uses this to then train a `XGBoost` model using the best parameters from `Optuna` where the metrics `mean absolute error`, `mean squared error`, `root mean squared error` and the `Coefficient of Determination` (R$^{2}$) are logged. Then the model can be saved and registered.


In [None]:
%%writefile {train_src_dir}/main.py
import os
import random
import numpy as np
import warnings
import argparse
import pandas as pd
import mlflow
import mlflow.xgboost
import joblib
from xgboost import XGBRegressor, plot_importance
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
warnings.filterwarnings('ignore')

seed_value = 42
os.environ['usedCars_xgbGPU'] = str(seed_value)
random.seed(seed_value)
np.random.seed(seed_value)

def main():
    """Main function of the script."""

    # Input and output arguments
    parser = argparse.ArgumentParser()
    parser.add_argument('--train_data', type=str, help='path to input train data')
    parser.add_argument('--test_data', type=str, help='path to input test data')
    parser.add_argument('--n_estimators', required=False, default=100, type=int)
    parser.add_argument('--max_depth', required=False, default=6, type=int)
    parser.add_argument('--subsample', required=False, default=1, type=float)
    parser.add_argument('--gamma', required=False, default=0, type=float)
    parser.add_argument('--learning_rate', required=False, default=0.3, type=float)
    parser.add_argument('--reg_alpha', required=False, default=0, type=float)
    parser.add_argument('--reg_lambda', required=False, default=1, type=float)
    parser.add_argument('--colsample_bytree', required=False, default=1, type=float)
    parser.add_argument('--colsample_bylevel', required=False, default=1, type=float)
    parser.add_argument('--min_child_weight', required=False, default=1, type=int)
    parser.add_argument('--registered_model_name', type=str, help='model name')
    args = parser.parse_args()

    # Start Logging
    mlflow.start_run()

    # enable autologging
    mlflow.xgboost.autolog()

    ###################
    #<prepare the data>
    ###################
    print(" ".join(f"{k}={v}" for k, v in vars(args).items()))

    print('input train data:', args.train_data)
    print('input test data:', args.test_data)

    trainDF = pd.read_csv(args.train_data, low_memory=False)
    testDF = pd.read_csv(args.test_data, low_memory=False)

    ####################
    #</prepare the data>
    ####################

    train_label = trainDF[['price']]
    test_label = testDF[['price']]

    train_features = trainDF.drop(columns = ['price'])
    test_features = testDF.drop(columns = ['price'])

    train_features = pd.get_dummies(train_features, drop_first=True)
    test_features = pd.get_dummies(test_features, drop_first=True)

    mlflow.log_metric('num_samples', train_features.shape[0])
    mlflow.log_metric('num_features', train_features.shape[1])

    print(f"Training with data of shape {train_features.shape}")

    ####################
    #</prepare the data>
    ####################

    ##################
    #<train the model>
    ##################
    best_model = XGBRegressor(objective='reg:squarederror',
                              metric='rmse',
                              booster='gbtree',
                              tree_method='gpu_hist',
                              scale_pos_weight=1,
                              use_label_encoder=False,
                              random_state=42,
                              verbosity=0,
                              n_estimators=args.n_estimators,
						      max_depth=args.max_depth,
						      subsample=args.subsample,
						      gamma=args.gamma,
						      learning_rate=args.learning_rate,
						      reg_alpha=args.reg_alpha,
						      reg_lambda=args.reg_lambda,
						      colsample_bytree=args.colsample_bytree,
						      colsample_bylevel=args.colsample_bylevel,
						      min_child_weight=args.min_child_weight)

    best_model.fit(train_features, train_label)

    print('\nModel Metrics for Used Cars XGBoost')
    y_train_pred = best_model.predict(train_features)
    y_test_pred = best_model.predict(test_features)

    train_mae = mean_absolute_error(train_label, y_train_pred)
    test_mae = mean_absolute_error(test_label, y_test_pred)
    train_mse = mean_squared_error(train_label, y_train_pred)
    test_mse = mean_squared_error(test_label, y_test_pred)
    train_rmse = mean_squared_error(train_label, y_train_pred, squared=False)
    test_rmse = mean_squared_error(test_label, y_test_pred, squared=False)
    train_r2 = r2_score(train_label, y_train_pred)
    test_r2 = r2_score(test_label, y_test_pred)

    mlflow.log_metric('train_mae', train_mae)
    mlflow.log_metric('train_mse', train_mse)
    mlflow.log_metric('train_rmse', train_rmse)
    mlflow.log_metric('train_r2', train_r2)
    mlflow.log_metric('test_mae', test_mae)
    mlflow.log_metric('test_mse', test_mse)
    mlflow.log_metric('test_rmse', test_rmse)
    mlflow.log_metric('test_r2', test_r2)

    print('MAE train: %.3f, test: %.3f' % (
            mean_absolute_error(train_label, y_train_pred),
            mean_absolute_error(test_label, y_test_pred)))
    print('MSE train: %.3f, test: %.3f' % (
            mean_squared_error(train_label, y_train_pred),
            mean_squared_error(test_label, y_test_pred)))
    print('RMSE train: %.3f, test: %.3f' % (
            mean_squared_error(train_label, y_train_pred, squared=False),
            mean_squared_error(test_label, y_test_pred, squared=False)))
    print('R^2 train: %.3f, test: %.3f' % (
            r2_score(train_label, y_train_pred),
            r2_score(test_label, y_test_pred)))
    ###################
    #</train the model>
    ###################

    ##########################
    #<save and register model>
    ##########################
    # Registering the model to the workspace
    print('Registering the model via MLFlow')
    mlflow.xgboost.log_model(
        xgb_model=best_model,
        registered_model_name=args.registered_model_name,
        artifact_path=args.registered_model_name,
    )

    # Saving the model to a file
    mlflow.xgboost.save_model(
        xgb_model=best_model,
        path=os.path.join(args.registered_model_name, 'trained_model'),
    )
    ###########################
    #</save and register model>
    ###########################

    # Stop Logging
    mlflow.end_run()

if __name__ == "__main__":
    main()

Writing ./src/main.py


In this script, once the model is trained, the model file is saved and registered to the workspace. Registering the model allows it to stored and versioned, hence a model registry. The model registry allows for trained models to be tracked, which can then be evaluated for data and concept drift over time.



 ## Train the model with specified components
To train the model, a `command job` configured with the input specifying the  input data and the hyperparameter conditions, which then runs the `training script` using the specified compute resource, environment, and the parameters specified to be logged needs to be submitted as a job.

In [None]:
from azure.ai.ml import command
from azure.ai.ml import Input

registered_model_name = 'usedcars_xgb_model'

job = command(
    inputs=dict(
        train_data=Input(
            type='uri_file',
            path='azureml://datastores/used_cars_datastore/paths/usedCars_trainSet.csv',
        ),
        test_data=Input(
            type='uri_file',
            path = 'azureml://datastores/used_cars_datastore/paths/usedCars_testSet.csv',
        ),
        n_estimators=549,
        max_depth=14,
        subsample=0.7997070496461064,
        gamma=2.953865805049196e-05,
        learning_rate=0.04001808814037916,
        reg_alpha=0.018852758055925938,
        reg_lambda=1.8216639376033342e-06,
        colsample_bytree=0.56819205236003,
        colsample_bylevel=0.5683397007952175,
        min_child_weight=7,
        registered_model_name=registered_model_name,
    ),

    code='./src/',
    command='python main.py --train_data ${{inputs.train_data}} --test_data ${{inputs.test_data}} --n_estimators ${{inputs.n_estimators}} --max_depth ${{inputs.max_depth}} --subsample ${{inputs.subsample}} --gamma ${{inputs.gamma}} --learning_rate ${{inputs.learning_rate}} --reg_alpha ${{inputs.reg_alpha}} --reg_lambda ${{inputs.reg_lambda}} --colsample_bytree ${{inputs.colsample_bytree}} --colsample_bylevel ${{inputs.colsample_bylevel}} --registered_model_name ${{inputs.registered_model_name}}',
    environment='aml-usedcars-gpu-xgboost@latest',
    compute='gpu-cluster-NC4as-T4-v3',
    display_name='usedcars_xgb_default_prediction',
)

## Submit the job
Then this job can be submitted to run in `Azure Machine Learning Studio` using the `create_or_update` command with `ml_client`.

In [None]:
ml_client.create_or_update(job)

Class AutoDeleteSettingSchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class AutoDeleteConditionSchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class BaseAutoDeleteSettingSchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class IntellectualPropertySchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class ProtectionLevelSchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class BaseIntellectualPropertySchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
[32mUploading src (0.01 MBs): 100%|██

Experiment,Name,Type,Status,Details Page
XGBoost,sleepy_chicken_d62tr37sk6,command,Starting,Link to Azure Machine Learning studio


## View Job Output
The submitted job can then be viewed by selecting the link in the output of the previous cell. The logged information with `MLFlow` including the model metrics can then be viewed/downloaded when the job completes.